<?xml version="1.0" encoding="UTF-8" standalone="no"?><rss xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" version="2.0">

<channel>
	<title>AWS Big Data Blog</title>
	<atom:link href="https://aws.amazon.com/blogs/big-data/feed/" rel="self" type="application/rss+xml"/>
	<link>https://aws.amazon.com/blogs/big-data/</link>
	<description>Official Big Data Blog of Amazon Web Services</description>
	<lastBuildDate>Fri, 12 Jun 2026 16:32:19 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	
	<item>
		<title>Access Amazon S3 data files directly using AWS Lake Formation permissions</title>
		<link>https://aws.amazon.com/blogs/big-data/access-amazon-s3-data-files-directly-using-aws-lake-formation-permissions/</link>
		
		<dc:creator><![CDATA[Aarthi Srinivasan]]></dc:creator>
		<pubDate>Fri, 12 Jun 2026 16:32:19 +0000</pubDate>
				<category><![CDATA[Advanced (300)]]></category>
		<category><![CDATA[Amazon Simple Storage Service (S3)]]></category>
		<category><![CDATA[Announcements]]></category>
		<category><![CDATA[AWS Lake Formation]]></category>
		<category><![CDATA[Technical How-to]]></category>
		<guid isPermaLink="false">8c5a717753d95f5e7fb5827a9fcf1b8c44c3af7c</guid>

					<description>In this post, we demonstrate reading from and writing to Lake Formation-managed S3 locations using Apache Spark jobs from EMR. Lake Formation credential vending for S3 location access is available in EMR release label 7.13 and later, Boto3 1.42.29 and later, AWS Java SDK 2.41.32 and later, and AWS Command Line Interface (AWS CLI) version 2.33.1 and later.</description>
										<content:encoded>&lt;p&gt;Data scientists and ML engineers often need to access raw data files in &lt;a href="https://aws.amazon.com/s3/" target="_blank" rel="noopener"&gt;Amazon Simple Storage Service&lt;/a&gt; (Amazon S3) for machine learning training, data exploration, and generative AI workflows. However, when table-level access is governed by &lt;a href="https://aws.amazon.com/lake-formation/" target="_blank" rel="noopener"&gt;AWS Lake Formation&lt;/a&gt;, accessing the underlying S3 files has required maintaining separate permission mechanisms. S3 bucket policies or &lt;a href="https://aws.amazon.com/s/idc-server-side-test/awswt-936-new-template-iam/control-awswt-936-/" target="_blank" rel="noopener"&gt;AWS Identity and Access Management&lt;/a&gt; (IAM) role policies create operational overhead and risk of permission drift.&lt;/p&gt; 
&lt;p&gt;Lake Formation now supports direct access to S3 data file locations for tables whose permissions it manages. Previously, data scientists with Lake Formation permissions on &lt;a href="https://aws.amazon.com/glue/" target="_blank" rel="noopener"&gt;AWS Glue&lt;/a&gt; Data Catalog tables could query them using &lt;code&gt;spark.sql()&lt;/code&gt;. Now, they can also read and write the underlying S3 data files using &lt;code&gt;spark.read.parquet()&lt;/code&gt; or &lt;code&gt;spark.read.csv()&lt;/code&gt; from &lt;a href="https://aws.amazon.com/emr/" target="_blank" rel="noopener"&gt;Amazon EMR&lt;/a&gt; Spark jobs, &lt;a href="https://aws.amazon.com/sagemaker/unified-studio/" target="_blank" rel="noopener"&gt;Amazon SageMaker Unified Studio&lt;/a&gt; notebooks with EMR compute, and custom applications. All access is governed by the same Lake Formation permissions.&lt;/p&gt; 
&lt;p&gt;This capability is powered by the new &lt;a href="https://docs.aws.amazon.com/lake-formation/latest/APIReference/API_GetTemporaryDataLocationCredentials.html" target="_blank" rel="noopener"&gt;GetTemporaryDataLocationCredentials()&lt;/a&gt; API, which vends temporary credentials scoped to registered S3 locations when callers have appropriate Lake Formation permissions on the corresponding Data Catalog tables. This eliminates the need to manage separate S3 bucket policies for file-level access while maintaining fine-grained access control in Lake Formation for table-based access. It enables your data scientists to explore S3 datasets securely, accelerate machine learning pipelines, and build generative AI workflows without compromising governance.&lt;/p&gt; 
&lt;p&gt;In this post, we demonstrate reading from and writing to Lake Formation-managed S3 locations using Apache Spark jobs from EMR. Lake Formation credential vending for S3 location access is available in EMR release label 7.13 and later, Boto3 1.42.29 and later, AWS Java SDK 2.41.32 and later, and AWS Command Line Interface (AWS CLI) version 2.33.1 and later.&lt;/p&gt; 
&lt;h2&gt;Key use cases for Lake Formation permissions to S3 locations&lt;/h2&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Unified permissions for Analytics and Machine Learning pipelines&lt;/strong&gt; – Data scientists can access both structured tables through SQL queries and underlying data files through programmatic APIs for machine learning and AI workloads. They are empowered to use tools of their choice – for example, use Amazon Athena for SQL analytics with the table names while read and write to the underlying files in their SageMaker notebook or Spark application with spark.read.parquet(“s3://bucket/database_path/table_files/).&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Enable AI ready data lakes&lt;/strong&gt; – Machine learning pipelines can read training data directly from governed data lakes. Generative AI applications can access foundation model training datasets, and data exploration workflows to use native file APIs while maintaining centralized governance and compliance.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Reduced operational complexity&lt;/strong&gt; – Operations teams don’t need to maintain separate permission policies – one in Lake Formation for table access and another in S3 bucket policies or AWS Identity and Access Management (IAM) roles for file access. This reduces the risk of permission mismatches and avoids inconsistent access control.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Unified audit capability&lt;/strong&gt; – Auditors do not need to examine multiple log sources, such as S3 Access Logs, &lt;a href="https://aws.amazon.com/cloudtrail/" target="_blank" rel="noopener"&gt;AWS CloudTrail&lt;/a&gt; events from different services, to understand who accessed what data and when. With this feature, you get a unified CloudTrail audit trail showing both table access through SQL engines and file access through direct APIs, with each access event linked to the Lake Formation permission grant.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;What customers are saying&lt;/h2&gt; 
&lt;p&gt;&lt;em&gt;“Through our close collaboration with AWS, Lake Formation’s new S3 location-based permissions have transformed how we manage data governance at Intuit. By unifying two separate access mechanisms for the same data into one unified permission model, we’ve dramatically reduced complexity and streamlined our auditing process. This is exactly the kind of simplification that lets our teams move faster without compromising security, ensuring we maintain the strict compliance and governance standards our regulators expect.”&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;— Tapan Upadhyay, Group Engineering Manager, Intuit&lt;/p&gt; 
&lt;h2&gt;Lake Formation Credential Vending Plugin for AWS SDK v2 for Java&lt;/h2&gt; 
&lt;p&gt;Lake Formation has made available a specialized library &lt;a href="https://github.com/aws/aws-lakeformation-accessgrants-plugin-java-v2" target="_blank" rel="noopener"&gt;AWS Lake Formation Credential Vending Plugin for AWS SDK V2 for Java&lt;/a&gt;. The Java plugin intercepts S3 requests for data, checks Lake Formation permissions for the requested location, and provides temporary scoped credentials to the client if permissions are granted in Lake Formation. If the S3 location access permissions are not managed by Lake Formation, the plugin checks for access in &lt;a href="https://aws.amazon.com/s3/features/access-grants/" target="_blank" rel="noopener"&gt;Amazon S3 Access Grants&lt;/a&gt; and lastly falls back to IAM permissions. The plugin is supported independently of Spark and comes as an enhancement to EMR Spark Full Table Access (FTA) mode, starting in EMR 7.13 and later. The plugin is integrated at the S3A level. Therefore, any client of S3A can enable it by setting the S3A configurations, in addition to the &lt;a href="https://docs.aws.amazon.com/emr/latest/ManagementGuide/lake-formation-unfiltered-ec2-access.html" target="_blank" rel="noopener"&gt;EMR Lake Formation Full Table Access (FTA) configuration&lt;/a&gt; as follows:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-plaintext"&gt;fs.s3a.lakeformation.access.grants.enabled = true
fs.s3a.lakeformation.access.grants.fallback.to.iam = true&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;With the Java plugin, you can enable governance for data lake resources in your custom applications with Lake Formation permissions – managing both fine grained access for users requiring restricted access on Data Catalog tables while providing direct S3 object level access to use-cases that require them.&lt;/p&gt; 
&lt;p&gt;Note: (1) The principal that will be accessing direct S3 locations of the tables will require full table access. That is, Lake Formation SELECT permission on all columns and rows of the table is required. (2) The Spark cluster needs FTA configuration. (3) Currently, Apache Iceberg table format is not supported with this plugin.&lt;/p&gt; 
&lt;h2&gt;Solution overview&lt;/h2&gt; 
&lt;p&gt;A financial services company runs daily ETL jobs using Spark in EMR. They process raw transaction records in S3 and store the processed records in another S3 location. The transformed Parquet data is registered with Lake Formation and cataloged as a table in Data Catalog. The ETL job will have direct IAM access to the raw data location, while it uses Lake Formation permissions to write to and read from the curated table location. Downstream, a data-analyst role will query the curated table, with restricted column access. The solution is shown in Figure 1.&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;Figure 1 – Architecture shows EMR Spark writing curated records to the S3 location of a table using Lake Formation permissions while Data-Analyst queries the same table with Lake Formation fine grained access control in Athena.&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/20/BDB-5860-1.png" alt="Architecture diagram showing EMR Spark writing curated records to the S3 location of a table using Lake Formation permissions while Data-Analyst queries the same table with Lake Formation fine-grained access control in Athena" width="600"&gt;&lt;/p&gt; 
&lt;h3&gt;Prerequisites&lt;/h3&gt; 
&lt;p&gt;To get started exploring this feature, we recommend you have the following setup.&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;An AWS account with a Lake Formation administrator configured. Refer to &lt;a href="https://docs.aws.amazon.com/lake-formation/latest/dg/permissions-reference.html#persona-dl-admin" target="_blank" rel="noopener"&gt;Data lake administrator permissions&lt;/a&gt; and &lt;a href="https://docs.aws.amazon.com/lake-formation/latest/dg/initial-lf-config.html" target="_blank" rel="noopener"&gt;Set up AWS Lake Formation&lt;/a&gt;. You can also refer to the blog &lt;a href="https://aws.amazon.com/blogs/big-data/simplify-data-access-for-your-enterprise-using-amazon-sagemaker-lakehouse/" target="_blank" rel="noopener"&gt;Simplify data access for your enterprise using Amazon SageMaker Lakehouse&lt;/a&gt; for the Lake Formation admin setup in your AWS account. For ease of following along, you can use an IAM administrator role, registered as a Lake Formation administrator role.&lt;/li&gt; 
 &lt;li&gt;An S3 bucket to host the sample table data.&lt;/li&gt; 
 &lt;li&gt;An IAM role to register the preceding table data in your S3 with Lake Formation. Refer the permissions policy and trust policy for this role in &lt;a href="https://docs.aws.amazon.com/lake-formation/latest/dg/registration-role.html" target="_blank" rel="noopener"&gt;Requirements for roles used to register locations&lt;/a&gt;.&lt;/li&gt; 
 &lt;li&gt;To run the Spark code in EMR, you can choose to run the code in either SageMaker Unified Studio with EMR compute or use EMR cluster from EMR console. In the case of SageMaker Unified Studio domain and project, the Lake Formation permissions for the table location will be granted to the project execution role. In this post, we will illustrate using an EMR on EC2 cluster and a runtime role to submit the Spark script as a step to the cluster. For instructions to launch an EMR on EC2 cluster with Lake Formation full table access enabled, refer to instructions here – &lt;a href="https://docs.aws.amazon.com/emr/latest/ManagementGuide/lake-formation-unfiltered-ec2-access.html" target="_blank" rel="noopener"&gt;Lake Formation full table access for Amazon EMR on EC2&lt;/a&gt; and &lt;a href="https://aws.amazon.com/blogs/big-data/introducing-runtime-roles-for-amazon-emr-steps-use-iam-roles-and-aws-lake-formation-for-access-control-with-amazon-emr/" target="_blank" rel="noopener"&gt;Introducing runtime roles for Amazon EMR steps: Use IAM roles and AWS Lake Formation for access control with Amazon EMR&lt;/a&gt;. Fine Grained Access Control (FGAC) option is not supported for Spark on EMR with this feature since S3 location permission is full file path access.&lt;/li&gt; 
 &lt;li&gt;An IAM role Data-Analyst, with permissions as detailed in &lt;a href="https://docs.aws.amazon.com/lake-formation/latest/dg/permissions-reference.html#persona-user" target="_blank" rel="noopener"&gt;Data analyst permissions&lt;/a&gt;.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;Solution walkthrough&lt;/h2&gt; 
&lt;p&gt;First, we will get the setup ready with S3, sample database, table, and data. We will add a raw data set to S3 location, create a table with parquet data in another S3 location that represents the curated dataset for further downstream consumption. We will register the table data location with Lake Formation and grant permissions for the EMR run time role and Data-Analyst role.&lt;/p&gt; 
&lt;p&gt;Your S3 bucket will have the following structure.&lt;/p&gt; 
&lt;p&gt;Raw data – &lt;code&gt;s3://&amp;lt;your-bucket-name&amp;gt;/raw/transactions/dt=2024-03-21/&lt;/code&gt;&lt;/p&gt; 
&lt;p&gt;Process data for table – &lt;code&gt;s3://&amp;lt;your-bucket-name&amp;gt;/processed/transactions/&lt;/code&gt;&lt;/p&gt; 
&lt;p&gt;Spark script – &lt;code&gt;s3://&amp;lt;your-bucket-name&amp;gt;/scripts/&lt;/code&gt;&lt;/p&gt; 
&lt;p&gt;Logs for the EMR cluster – &lt;code&gt;s3://&amp;lt;your-bucket-name&amp;gt;/logs/&lt;/code&gt;&lt;/p&gt; 
&lt;h3&gt;Step 1 – Create a parquet table in Data Catalog&lt;/h3&gt; 
&lt;p&gt;From the Athena console query editor, create a table in Data Catalog.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-sql"&gt;-- Create a database
CREATE DATABASE finance_db;

-- Create an external table pointing to the S3 location
CREATE EXTERNAL TABLE IF NOT EXISTS finance_db.transactions_processed (
    transaction_id STRING,
    merchant_name STRING,
    amount DECIMAL(18,2),
    currency STRING,
    account_number STRING,
    card_type STRING,
    status STRING,
    region STRING
)
PARTITIONED BY (transaction_date DATE)
STORED AS PARQUET
LOCATION 's3:///processed/transactions/'
TBLPROPERTIES (
    'parquet.compress'='SNAPPY'
);&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;h3&gt;Step 2 – Register S3 location and grant table permission to IAM roles in Lake Formation&lt;/h3&gt; 
&lt;p&gt;2.1 Register the table data location &lt;code&gt;s3://&amp;lt;your-bucket-name&amp;gt;/processed/transactions/&lt;/code&gt; with Lake Formation in Lake Formation mode using the custom S3 registration IAM role. For details on how to register locations with Lake Formation, refer &lt;a href="https://docs.aws.amazon.com/lake-formation/latest/dg/register-data-lake.html" target="_blank" rel="noopener"&gt;Adding an Amazon S3 location to your data lake&lt;/a&gt;.&lt;/p&gt; 
&lt;p&gt;2.2 Grant DESCRIBE permission on the database &lt;code&gt;finance_db&lt;/code&gt; and ALL permission on the table &lt;code&gt;transactions_processed&lt;/code&gt; to your EMR runtime role.&lt;/p&gt; 
&lt;p&gt;2.3 Grant Data location permission to EMR runtime role on the curated table’s location. This is to allow writing to that location.&lt;/p&gt; 
&lt;p&gt;2.4 Grant DESCRIBE permission on the database &lt;code&gt;finance_db&lt;/code&gt; and SELECT permission on the table &lt;code&gt;transactions_processed&lt;/code&gt; to your Data-Analyst role. Exclude the columns &lt;code&gt;transaction_id&lt;/code&gt; and &lt;code&gt;account_number&lt;/code&gt; while granting SELECT permissions on the table to the Data-Analyst role.&lt;/p&gt; 
&lt;p&gt;For details on how to grant Lake Formation permissions, refer &lt;a href="https://docs.aws.amazon.com/lake-formation/latest/dg/granting-database-permissions.html" target="_blank" rel="noopener"&gt;Granting database permissions using the named resource method&lt;/a&gt;; &lt;a href="https://docs.aws.amazon.com/lake-formation/latest/dg/granting-table-permissions.html" target="_blank" rel="noopener"&gt;Granting table permissions using the named resource method&lt;/a&gt; and &lt;a href="https://docs.aws.amazon.com/lake-formation/latest/dg/granting-location-permissions.html" target="_blank" rel="noopener"&gt;Granting data location permissions&lt;/a&gt;.&lt;/p&gt; 
&lt;h3&gt;Step 3 – Run ETL script in EMR&lt;/h3&gt; 
&lt;p&gt;3.1 Download the script &lt;a href="https://aws-blogs-artifacts-public.s3.us-east-1.amazonaws.com/artifacts/BDB-5860/bdb-5860-script.py" target="_blank" rel="noopener"&gt;bdb-5860-script.py&lt;/a&gt;.&lt;/p&gt; 
&lt;p&gt;3.2 Edit the S3 bucket name placeholder in the script (&lt;code&gt;RAW_PATH&lt;/code&gt; and &lt;code&gt;TABLE_PATH)&lt;/code&gt; to your resource names and upload to your S3 path &lt;code&gt;s3://&amp;lt;your-bucket-name&amp;gt;/scripts/&lt;/code&gt;.&lt;/p&gt; 
&lt;p&gt;3.3 Make sure your EMR runtime role has access to the script location in its IAM policy permissions.&lt;/p&gt; 
&lt;p&gt;3.4 Submit and run the script as a step to the EMR cluster, following instructions at &lt;a href="https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-submit-step.html" target="_blank" rel="noopener"&gt;Add a Spark step&lt;/a&gt;.&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;What does the script do?&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;It populates raw records of transaction data into a Spark data frame, writes to the raw data bucket location using IAM permissions on the EMR runtime role. We apply some transformations and write directly to the S3 location of the table that is registered with Lake Formation, from the data frame using Spark’s native Parquet writer.&lt;/p&gt; 
&lt;p&gt;The following figure shows the stdout of the step.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/20/BDB-5860-2.png" alt="EMR step stdout showing successful Spark job execution with data written to the Lake Formation-managed S3 location" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;The Java plugin integrated into EMR 7.13 automatically handles the access for the table’s data location registered with Lake Formation, so you don’t need to manually call the &lt;code&gt;GetTemporaryDataLocationCredentials()&lt;/code&gt; API. In this example, the table data location &lt;code&gt;s3://&amp;lt;your-bucket-name&amp;gt;/processed/transactions/&lt;/code&gt; is registered with Lake Formation, for which EMR runtime role is granted ALL permissions. The direct S3 location access support by Lake Formation allows reading and writing to the location directly using Spark data frame.&lt;/p&gt; 
&lt;h3&gt;Step 4 – Run query as Data-Analyst using Athena&lt;/h3&gt; 
&lt;p&gt;Log in as the Data-Analyst role to the Athena console. Run a select query on the table as follows.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-sql"&gt;SELECT * FROM finance_db.transactions_processed WHERE status = 'DECLINED' AND transaction_date=DATE '2024-03-21';&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;The Data-Analyst role should see all but two columns of the table.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/20/BDB-5860-3.png" alt="Athena query results showing the Data-Analyst role can access all columns except transaction_id and account_number" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;With these steps complete, we’ve read from and written to direct S3 locations using Spark data frames with the syntax &lt;code&gt;s3://bucketname/prefix/&lt;/code&gt;, and accessed the same data using &lt;code&gt;database_name.table_name&lt;/code&gt; syntax with Lake Formation permissions. This shows fine-grained access at table level and coarse-grained access at the file path level.&lt;/p&gt; 
&lt;h3&gt;Clean up&lt;/h3&gt; 
&lt;p&gt;To avoid incurring costs, clean up the resources you created for this post.&lt;/p&gt; 
&lt;ol type="1"&gt; 
 &lt;li&gt;Delete the Data Catalog database and tables. This removes the related Lake Formation permissions too. Remove the S3 bucket registration from Lake Formation.&lt;/li&gt; 
 &lt;li&gt;Delete the data files, logs, and the PySpark script of this post from your S3 bucket.&lt;/li&gt; 
 &lt;li&gt;Terminate the EMR cluster.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;h2&gt;Conclusion&lt;/h2&gt; 
&lt;p&gt;In this post, we showed how to use Lake Formation’s direct S3 location access to read and write data files using Spark data frames from Amazon EMR, while maintaining unified governance through Lake Formation permissions. We walked through the &lt;code&gt;GetTemporaryDataLocationCredentials()&lt;/code&gt; API and the AWS Lake Formation Credential Vending Plugin for AWS SDK v2 for Java, which is integrated into EMR release labels 7.13 and later.&lt;/p&gt; 
&lt;p&gt;This capability unifies permission management for both fine-grained table-based access and direct S3 file path access in Lake Formation. Your data scientists can now use &lt;code&gt;spark.read.parquet()&lt;/code&gt; and &lt;code&gt;spark.write&lt;/code&gt; alongside &lt;code&gt;spark.sql()&lt;/code&gt;, governed by the same permissions, audited in the same CloudTrail logs, and managed from a single console.&lt;/p&gt; 
&lt;p&gt;&lt;span data-contrast="auto"&gt;To get started,&amp;nbsp;launch an&amp;nbsp;&lt;/span&gt;&lt;a href="https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-7130-release.html" target="_blank" rel="noopener"&gt;&lt;span data-contrast="none"&gt;EMR 7.13 cluster&lt;/span&gt;&lt;/a&gt;&lt;span data-contrast="auto"&gt; and start exploring the feature. Here are some additional resources:&lt;/span&gt;&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;span data-contrast="auto"&gt;Lake Formation documentation on&amp;nbsp;&lt;/span&gt;&lt;a href="https://docs.aws.amazon.com/lake-formation/latest/dg/accessing-s3-locations.html" target="_blank" rel="noopener"&gt;&lt;span data-contrast="none"&gt;Accessing Amazon S3 locations&lt;/span&gt;&lt;/a&gt;&lt;span data-contrast="auto"&gt;&amp;nbsp;&lt;/span&gt;&lt;span data-ccp-props="{}"&gt;&amp;nbsp;&lt;/span&gt;&lt;/li&gt; 
 &lt;li&gt;&lt;span data-contrast="auto"&gt;EMR documentation on&amp;nbsp;&lt;/span&gt;&lt;a href="https://docs.aws.amazon.com/emr/latest/ManagementGuide/lake-formation-path-based-credential-vending.html" target="_blank" rel="noopener"&gt;&lt;span data-contrast="none"&gt;S3 path-based access using Lake Formation for Amazon EMR Spark&lt;/span&gt;&lt;/a&gt;&lt;span data-ccp-props="{&amp;quot;335551550&amp;quot;:0,&amp;quot;335551620&amp;quot;:0}"&gt;&amp;nbsp;&lt;/span&gt;&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;Acknowledgements: We would like to thank all the team members who worked to launch this feature successfully – Rajas Bhate, Akhil Yendluri, Kunal Parikh, Sharda Khubchandani, Dhananjay Badaya, Santhosh Padmanabhan, Nitin Agrawal and Sandeep Adwankar.&lt;/p&gt; 
&lt;hr&gt; 
&lt;h2&gt;About the authors&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt; 
   &lt;p&gt;&lt;img loading="lazy" class="alignleft size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/20/BDB-5860-4.png" alt="Aarthi Srinivasan" width="100" height="100"&gt;&lt;/p&gt; 
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Aarthi Srinivasan&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/aarthis/" target="_blank" rel="noopener"&gt;Aarthi&lt;/a&gt; is a Senior Big Data Architect at Amazon Web Services (AWS). She works with AWS customers and partners to architect data lake solutions, enhance product features, and establish best practices for data governance.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt; 
   &lt;p&gt;&lt;img loading="lazy" class="alignleft size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/20/BDB-5860-5.jpg" alt="Archana Inapudi" width="100" height="100"&gt;&lt;/p&gt; 
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Archana Inapudi&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/inapudi/" target="_blank" rel="noopener"&gt;Archana&lt;/a&gt; is a Senior Solutions Architect at Amazon Web Services (AWS). She works with strategic enterprise customers to drive cloud data modernization, architect data lake and analytics solutions, and establish best practices for data governance and security. With over 15 years of experience in cloud, data engineering, and AI/ML, Archana is passionate about using technology to accelerate growth and deliver business outcomes.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt; 
   &lt;p&gt;&lt;img loading="lazy" class="alignleft size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/20/BDB-5860-6.png" alt="Srinivasan Krishnasamy" width="100" height="100"&gt;&lt;/p&gt; 
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Srinivasan Krishnasamy&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/srinikris/" target="_blank" rel="noopener"&gt;Srinivasan&lt;/a&gt; is a Principal Delivery Consultant at AWS with 25+ years of experience architecting data and analytics solutions at scale. He partners with enterprise customers to modernize data platforms, build robust data governance frameworks, and drive measurable business outcomes on AWS, using the full spectrum of data engineering, AI/ML, and generative AI. Outside of work, he enjoys hiking, swimming, and gardening.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt; 
   &lt;p&gt;&lt;img loading="lazy" class="alignleft size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/20/BDB-5860-7.png" alt="Anandkumar Kaliaperumal" width="100" height="100"&gt;&lt;/p&gt; 
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Anandkumar Kaliaperumal&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/anandkumarsk/" target="_blank" rel="noopener"&gt;Anandkumar&lt;/a&gt; is a Senior Delivery Consultant at AWS, bringing over 23 years of deep expertise in data and analytics. A specialist in architecting scalable data analytics, AI/ML, and generative AI solutions, he thrives on tackling complex data challenges spanning data engineering, analytics, machine learning, and generative AI workloads.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt; 
   &lt;p&gt;&lt;img loading="lazy" class="alignleft size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/20/BDB-5860-8.png" alt="Mitali Sheth" width="100" height="100"&gt;&lt;/p&gt; 
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Mitali Sheth&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/mitali-sheth/" target="_blank" rel="noopener"&gt;Mitali&lt;/a&gt; is a Streaming Data Engineer at Amazon Web Services (AWS) Professional Services. She works with strategic software customers to architect real-time analytics solutions, design event-driven architectures, and modernize streaming infrastructure using Amazon MSK, Amazon Managed Flink, AWS Glue, and AWS Lake Formation. She holds an M.S. in Computer Science from the University of Florida.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
		
		
			</item>
		<item>
		<title>Building AI shopping agent using Amazon Bedrock AgentCore Runtime and Amazon OpenSearch Service</title>
		<link>https://aws.amazon.com/blogs/big-data/building-ai-shopping-agent-using-amazon-bedrock-agentcore-runtime-and-amazon-opensearch-service/</link>
		
		<dc:creator><![CDATA[Omama Khurshid]]></dc:creator>
		<pubDate>Thu, 11 Jun 2026 15:27:16 +0000</pubDate>
				<category><![CDATA[Advanced (300)]]></category>
		<category><![CDATA[Amazon Bedrock AgentCore]]></category>
		<category><![CDATA[Amazon OpenSearch Service]]></category>
		<category><![CDATA[Technical How-to]]></category>
		<guid isPermaLink="false">8e30a8ee2437fec0011d3ab2a7c3ce603dfd7e70</guid>

					<description>In this post, we explore how to build an online shopping AI agent. We focus on its architecture and implementation with Amazon OpenSearch Service, Amazon Bedrock AgentCore, and Strands Agents. Amazon Bedrock AgentCore is an agentic platform for deploying and operating those agents and tools securely at scale without managing infrastructure.</description>
										<content:encoded>&lt;p&gt;In this post, we explore how to build an online shopping AI agent. We focus on its architecture and implementation with &lt;a href="https://aws.amazon.com/opensearch-service/" target="_blank" rel="noopener"&gt;Amazon OpenSearch Service&lt;/a&gt;, &lt;a href="https://aws.amazon.com/bedrock/agentcore/" target="_blank" rel="noopener"&gt;Amazon Bedrock AgentCore&lt;/a&gt;, and &lt;a href="https://strandsagents.com/latest/" target="_blank" rel="noopener"&gt;Strands Agents&lt;/a&gt;. Amazon Bedrock AgentCore is an agentic platform for deploying and operating those agents and tools securely at scale without managing infrastructure. AgentCore Runtime is the secure, serverless runtime that hosts your Strands Agents and tools as containerized applications. Strands Agents is an open source SDK for building AI agents. In this SDK, an agent is defined by a model, tools, and a prompt. &lt;a href="https://strandsagents.com/latest/documentation/docs/user-guide/concepts/tools/" target="_blank" rel="noopener"&gt;Tools&lt;/a&gt; are callable functions that allow agents to perform actions beyond text generation, such as API calls, database queries, and file operations. The framework lets the model autonomously plan steps and invoke tools to complete tasks.&lt;/p&gt; 
&lt;p&gt;Today’s AI shopping assistants understand natural language, context, and shopping intent, creating a more human-like interaction. These assistants handle complex shopping requirements, such as “Find me a formal dress under $200 that’s appropriate for a summer wedding.” They maintain conversation history, process follow-up questions naturally, and provide personalized recommendations based on user preferences and past interactions. Customers can use visual search to upload images of items that they want, and the AI finds similar products across multiple retailers, matching styles and patterns. The goal is to provide instant, relevant, and personalized assistance at scale, creating a more efficient shopping journey for consumers worldwide.&lt;/p&gt; 
&lt;p&gt;AI agents combined with Retrieval Augmented Generation (RAG) on Amazon OpenSearch Service represent an evolution in conversational search. This integration builds AI agents on enriched catalogs, supporting context-aware and autonomous search experiences while maintaining accuracy and relevance through grounded responses.&lt;/p&gt; 
&lt;h2 id="solution-overview"&gt;Solution overview&lt;/h2&gt; 
&lt;p&gt;The following diagram illustrates the solution architecture of an AI-powered online shopping agent built using Strands Agents, Amazon Bedrock AgentCore Runtime, and Amazon OpenSearch Service. For simplicity, the diagram doesn’t show authentication and authorization. In a production setup, secure access to the backend by using mechanisms such as Amazon API Gateway, AWS Identity and Access Management (IAM) roles, or OAuth-based authentication.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/06/03/BDB-5606-1.png" alt="Architecture diagram showing an AI shopping agent: a user prompt flows from the front end to AgentCore Runtime, which routes the request to a Strands Retail Agent that calls a search tool against Amazon OpenSearch Service and an Amazon Bedrock LLM, then returns a natural-language response to the user." width="600"&gt;&lt;/p&gt; 
&lt;p&gt;The following is a walkthrough of the reference architecture:&lt;/p&gt; 
&lt;ol type="1"&gt; 
 &lt;li&gt;The user submits a question through the front-end application. AgentCore Runtime receives the request and routes it to the Strands Retail Agent.&lt;/li&gt; 
 &lt;li&gt;The Strands Agent processes the task and invokes the &lt;code&gt;search_product_catalog&lt;/code&gt; tool.&lt;/li&gt; 
 &lt;li&gt;OpenSearch Service performs semantic search and returns relevant product results.&lt;/li&gt; 
 &lt;li&gt;The Strands Agent invokes Amazon Bedrock large language models (LLMs) to generate a natural language response.&lt;/li&gt; 
 &lt;li&gt;The agent response is returned to the user through the front end.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;h2 id="walkthrough"&gt;Walkthrough&lt;/h2&gt; 
&lt;p&gt;The following section walks you through how to build an online shopping AI agent.&lt;/p&gt; 
&lt;h3 id="prerequisites"&gt;Prerequisites&lt;/h3&gt; 
&lt;p&gt;To implement this solution, you need an &lt;a href="https://signin.aws.amazon.com/signup?request_type=register" target="_blank" rel="noopener"&gt;AWS account&lt;/a&gt;. You also need an OpenSearch Service domain with OpenSearch version 2.13 or later. You can use an existing domain or &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/createupdatedomains.html" target="_blank" rel="noopener"&gt;create a new domain&lt;/a&gt;.&lt;/p&gt; 
&lt;p&gt;To use the vector search capabilities of OpenSearch Service with Strands Agents on &lt;a href="https://aws.amazon.com/bedrock/agentcore/" target="_blank" rel="noopener"&gt;AgentCore&lt;/a&gt;, you use &lt;a href="https://docs.opensearch.org/latest/ingest-pipelines/" target="_blank" rel="noopener"&gt;ingest pipelines&lt;/a&gt;. These ingestion pipelines apply built-in processors to pre-process your documents before you index them in OpenSearch Service.&lt;/p&gt; 
&lt;p&gt;You use the &lt;a href="https://docs.opensearch.org/latest/ingest-pipelines/processors/text-embedding/" target="_blank" rel="noopener"&gt;text_embedding processor&lt;/a&gt;, which relies on the &lt;a href="https://docs.opensearch.org/1.3/ml-commons-plugin/" target="_blank" rel="noopener"&gt;ML Commons plugin&lt;/a&gt; and a registered embedding model—&lt;a href="https://aws.amazon.com/blogs/aws/amazon-nova-multimodal-embeddings-now-available-in-amazon-bedrock/" target="_blank" rel="noopener"&gt;Amazon Nova Multimodal Embeddings&lt;/a&gt; on Amazon Bedrock. OpenSearch Service uses the ML Commons plugin to generate vector embedding for your data and uses the same model to convert incoming queries into vectors. This supports semantic search across your indexed content.&lt;/p&gt; 
&lt;p&gt;You extend your semantic search backend by adding an agent built with Strands Agents and deployed on Amazon Bedrock AgentCore.&lt;/p&gt; 
&lt;p&gt;Code samples provided in this post are tested in Python 3.11. You only need to install Python 3.11 in your environment to execute the python scripts. You also need &lt;a href="http://nodejs.org"&gt;Node.js 18&lt;/a&gt; or later installed to use the AgentCore CLI. The provided code scripts will deploy into your AWS account so make sure your terminal has access to necessary AWS credentials.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Install AgentCore CLI&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;Install the AgentCore CLI globally using npm:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-bash"&gt;npm install -g @aws/agentcore&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;&lt;strong&gt;Python Dependencies&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;You also need to create a &lt;code&gt;requirements.txt&lt;/code&gt; file with following dependencies in your workspace to deploy the agents.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-python"&gt;boto3
uv
opensearch-py
requests-aws4auth
strands-agents
strands-agents-tools
bedrock-agentcore&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Run &lt;code&gt;pip install -r requirements.txt&lt;/code&gt; in your terminal to install the required dependencies. To avoid conflicts with other dependencies in your system, you can use a &lt;a href="https://docs.python.org/3/library/venv.html" target="_blank" rel="noopener"&gt;virtual environment&lt;/a&gt;.&lt;/p&gt; 
&lt;p&gt;Now, walk through each step.&lt;/p&gt; 
&lt;h3 id="step-1-configure-iam-permissions"&gt;Step 1: Configure IAM permissions&lt;/h3&gt; 
&lt;p&gt;Complete the following steps to register the Nova Multimodal Embeddings model with OpenSearch Service and verify that your OpenSearch Service domain has permission to invoke the Amazon Bedrock API.&lt;/p&gt; 
&lt;ol type="1"&gt; 
 &lt;li&gt;Go to the IAM console and &lt;a href="https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create.html" target="_blank" rel="noopener"&gt;create a new role&lt;/a&gt; with a custom trust policy. Add the following trust policy. 
  &lt;div class="hide-language"&gt; 
   &lt;pre&gt;&lt;code class="language-json"&gt;{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "Statement1",
      "Effect": "Allow",
      "Principal": {
        "Service": "opensearchservice.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}&lt;/code&gt;&lt;/pre&gt; 
  &lt;/div&gt; &lt;/li&gt; 
 &lt;li&gt;Skip adding a permission policy.&lt;/li&gt; 
 &lt;li&gt;Give your role a name and create it. For this post, we use &lt;strong&gt;OpenSearchBedrockEmbeddingRole&lt;/strong&gt; as the role name. OpenSearch Service uses this role to invoke the Nova Multimodal Embeddings model on Amazon Bedrock.&lt;/li&gt; 
 &lt;li&gt;On the &lt;strong&gt;Permissions&lt;/strong&gt; tab, attach an inline policy with the following permissions. For this post, we name this policy &lt;strong&gt;OpenSearchBedrockEmbeddingPolicy&lt;/strong&gt;. 
  &lt;div class="hide-language"&gt; 
   &lt;pre&gt;&lt;code class="language-json"&gt;{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "Statement1",
      "Effect": "Allow",
      "Action": [
        "bedrock:InvokeAgent",
        "bedrock:InvokeModel"
      ],
      "Resource": [
        "arn:aws:bedrock:us-east-1::foundation-model/*"
      ]
    }
  ]
}&lt;/code&gt;&lt;/pre&gt; 
  &lt;/div&gt; &lt;/li&gt; 
 &lt;li&gt;Create a &lt;code&gt;passRole&lt;/code&gt; policy with the following JSON document and assign it to the IAM role that creates the ML connector. This lets the principal running the Python code pass the &lt;strong&gt;OpenSearchBedrockEmbeddingRole&lt;/strong&gt; to OpenSearch. Replace &lt;code&gt;&amp;lt;your-aws-account-id&amp;gt;&lt;/code&gt; with your own AWS account ID. 
  &lt;div class="hide-language"&gt; 
   &lt;pre&gt;&lt;code class="language-json"&gt;{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": "iam:PassRole",
      "Resource": "arn:aws:iam::\$&amp;lt;your-aws-account-id&amp;gt;:role/OpenSearchBedrockEmbeddingRole"
    }
  ]
}&lt;/code&gt;&lt;/pre&gt; 
  &lt;/div&gt; &lt;/li&gt; 
 &lt;li&gt;By using fine-grained access control (FGAC), map the IAM role as a backend role for the &lt;code&gt;ml_full_access&lt;/code&gt; role in the OpenSearch Dashboards Security plugin. This mapping lets the user create ML connectors: 
  &lt;ol type="1"&gt; 
   &lt;li&gt;Log in to OpenSearch Dashboards and open the &lt;strong&gt;Security&lt;/strong&gt; page from the navigation menu.&lt;/li&gt; 
   &lt;li&gt;Choose &lt;strong&gt;Roles&lt;/strong&gt; and select &lt;code&gt;ml_full_access&lt;/code&gt;.&lt;/li&gt; 
   &lt;li&gt;Choose &lt;strong&gt;Mapped Users&lt;/strong&gt; and &lt;strong&gt;Manage Mapping&lt;/strong&gt;.&lt;/li&gt; 
   &lt;li&gt;Under &lt;strong&gt;Backend roles&lt;/strong&gt;, add the ARN of the IAM role that you created in the previous steps.&lt;/li&gt; 
  &lt;/ol&gt; &lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/06/03/BDB-5606-2.gif" alt="Animated demo of OpenSearch Dashboards Security plugin showing the ml_full_access role with the Mapped Users tab open and an IAM role being added as a backend role." width="600"&gt;&lt;/p&gt; 
&lt;h3 id="step-2-connect-to-the-model-by-using-opensearch-ml-connectors"&gt;Step 2: Connect to the model by using OpenSearch ML Connectors&lt;/h3&gt; 
&lt;p&gt;In this section, you create an ML connector to link OpenSearch Service with the Bedrock Nova Multimodal Embeddings model. You then register and deploy the model so you can use it for neural search queries.&lt;/p&gt; 
&lt;ol type="1"&gt; 
 &lt;li&gt;Create a file named &lt;code&gt;create-connector.py&lt;/code&gt; with the following code. Replace &lt;code&gt;&amp;lt;your hostname&amp;gt;&lt;/code&gt;, &lt;code&gt;&amp;lt;your region&amp;gt;&lt;/code&gt;, and &lt;code&gt;&amp;lt;your account id&amp;gt;&lt;/code&gt; placeholders within the code.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-python"&gt;import boto3
import requests
from requests_aws4auth import AWS4Auth
host = '&amp;lt;your hostname&amp;gt;'##CHANGE THIS
region = '&amp;lt;your region&amp;gt;' ##CHANGE THIS
account_id = '&amp;lt;your account id&amp;gt;' ##CHANGE THIS
service = 'es'
credentials = boto3.Session().get_credentials()
awsauth = AWS4Auth(credentials.access_key, credentials.secret_key, region, service, session_token=credentials.token)
path = '/_plugins/_ml/connectors/_create'
url = host + path
payload = {
"name": "Amazon Bedrock Nova multimodal model - text embedding",
"description": "Test connector for Amazon Bedrock Nova multimodal model - text embedding",
"version": 1,
"protocol": "aws_sigv4",
"credential": {
"roleArn": f"arn:aws:iam::{account_id}:role/OpenSearchBedrockEmbeddingRole"
},
"parameters": {
"region": region,
"service_name": "bedrock",
"model": "amazon.nova-2-multimodal-embeddings-v1:0",
"input_docs_processed_step_size": 1,
"dimensions": 1024,
"embeddingTypes": [
"float"
],
"truncationMode": "NONE"
},
"actions": [
{
"action_type": "predict",
"method": "POST",
"headers": {
"content-type": "application/json",
"x-amz-content-sha256": "required"
},
"url": "https://bedrock-runtime.${parameters.region}.amazonaws.com/model/${parameters.model}/invoke",
"request_body": "{\\n \\"taskType\\": \\"SINGLE_EMBEDDING\\",\\n \\"singleEmbeddingParams\\": {\\n \\"embeddingPurpose\\": \\"GENERIC_INDEX\\",\\n \\"embeddingDimension\\": ${parameters.dimensions},\\n \\"text\\": {\\n \\"truncationMode\\": \\"${parameters.truncationMode}\\",\\n \\"value\\": \\"${parameters.inputText}\\"\\n }\\n }\\n}",
"pre_process_function": "connector.pre_process.bedrock.nova.text_embedding",
"post_process_function": "connector.post_process.bedrock.nova.embedding"
}
]
}
headers = {"Content-Type": "application/json"}
r = requests.post(url, auth=awsauth, json=payload, headers=headers, timeout=15)
print(r.status_code)
print(r.text)&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;ol start="2" type="1"&gt; 
 &lt;li&gt;Run &lt;code&gt;python create-connector.py&lt;/code&gt; in your terminal by using the IAM role with &lt;code&gt;ml_full_access&lt;/code&gt; and &lt;code&gt;passRole&lt;/code&gt; permissions created in the previous step. This script creates a connector between OpenSearch Service and the Bedrock Nova Multimodal Embeddings model.&lt;/li&gt; 
 &lt;li&gt;The program responds with &lt;code&gt;connector_id&lt;/code&gt;. Take a note of it. Then, navigate to OpenSearch Dashboards and open &lt;strong&gt;Dev Tools&lt;/strong&gt;. Create a model group against which to register this model in the OpenSearch Service domain.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-json"&gt;POST /_plugins/_ml/model_groups/_register
{
  "name": "agent-conversational-search-model-group",
  "description": "A model group for bedrock Nova embedding models used for conversational search"
}&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;ol start="4" type="1"&gt; 
 &lt;li&gt;Register a model by using &lt;code&gt;connector_id&lt;/code&gt; and &lt;code&gt;model_group_id&lt;/code&gt;.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-json"&gt;POST /_plugins/_ml/models/_register
{
  "name": "nova-2-multimodal-embedding-v1",
  "function_name": "remote",
  "model_group_id": "&amp;lt;model group id&amp;gt;",
  "description": "Nova 2 Multimodal Embeddings Model",
  "connector_id": "&amp;lt;connector id&amp;gt;",
  "interface": {}
}&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;ol start="5" type="1"&gt; 
 &lt;li&gt;Run the following API call to deploy the model. Use the registered model ID from the previous step.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-json"&gt;POST /_plugins/_ml/models/&amp;lt;registered-model-id&amp;gt;/_deploy&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;h3 id="step-3-create-an-ingest-pipeline-for-data-indexing"&gt;Step 3: Create an ingest pipeline for data indexing&lt;/h3&gt; 
&lt;p&gt;Use the following code to create an ingest pipeline for data indexing. The pipeline establishes a connection to the embedding model, retrieves the embedding for the &lt;code&gt;title&lt;/code&gt; field, and stores it in the OpenSearch index.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-json"&gt;PUT /_ingest/pipeline/nova_multimodal_embedding
{
  "description": "Text embedding pipeline using nova_multimodal_embedding",
  "processors": [
    {
      "text_embedding": {
        "model_id": "&amp;lt;deployed model id&amp;gt;",
        "field_map": {
          "title": "title_vector"
        }
      }
    }
  ]
}&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;h3 id="step-4-create-an-index-for-storing-data"&gt;Step 4: Create an index for storing data&lt;/h3&gt; 
&lt;p&gt;Create an index named &lt;code&gt;product&lt;/code&gt; for storing data by using Dev Tools. This index stores raw text and 1024-dimensional embeddings of the &lt;code&gt;title&lt;/code&gt; field, and uses the ingest pipeline you created in the previous step.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-json"&gt;PUT /product
{
  "settings": {
    "index": {
      "default_pipeline": "nova_multimodal_embedding",
      "knn": true
    }
  },
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "standard"
      },
      "title_vector": {
        "type": "knn_vector",
        "dimension": 1024,
        "method": {
          "name": "hnsw",
          "space_type": "cosinesimil",
          "engine": "lucene"
        }
      },
      "price": {
        "type": "float"
      },
      "description": {
        "type": "text"
      },
      "category": {
        "type": "keyword"
      },
      "image_url": {
        "type": "text"
      },
      "rating": {
        "properties": {
          "rate": {
            "type": "float"
          },
          "count": {
            "type": "integer"
          }
        }
      }
    }
  }
}&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;h3 id="step-5-ingest-sample-data"&gt;Step 5: Ingest sample data&lt;/h3&gt; 
&lt;p&gt;Use the following code to ingest the sample product data in Dev Tools.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-json"&gt;POST /_bulk
{"index": {"_index": "product", "_id": "2"}}
{"id":2,"title":"Mens Casual Premium Slim Fit T-Shirts","price":22.3,"description":"Slim-fitting style, contrast raglan long sleeve, three-button henley placket, light weight &amp;amp; soft fabric for breathable and comfortable wearing.","category":"men's clothing","image":"https://fakestoreapi.com/img/71-3HjGNDUL._AC_SY879._SX._UX._SY._UY_.jpg","rating":{"rate":4.1,"count":259}}
{"index": {"_index": "product", "_id": "3"}}
{"id":3,"title":"Mens Cotton Jacket","price":55.99,"description":"great outerwear jackets for Spring/Autumn/Winter, suitable for many occasions, such as working, hiking, camping, mountain/rock climbing, cycling, traveling or other outdoors.","category":"men's clothing","image":"https://fakestoreapi.com/img/71li-ujtlUL._AC_UX679_.jpg","rating":{"rate":4.7,"count":500}}
{"index": {"_index": "product", "_id": "4"}}
{"id":4,"title":"Mens Casual Slim Fit","price":15.99,"description":"The color could be slightly different between on the screen and in practice.","category":"men's clothing","image":"https://fakestoreapi.com/img/71YXzeOuslL._AC_UY879_.jpg","rating":{"rate":2.1,"count":430}}
{"index": {"_index": "product", "_id": "5"}}
{"id":5,"title":"John Hardy Women's Legends Naga Gold &amp;amp; Silver Dragon Station Chain Bracelet","price":695,"description":"From our Legends Collection, the Naga was inspired by the mythical water dragon that protects the ocean's pearl.","category":"jewelery","image":"https://fakestoreapi.com/img/71pWzhdJNwL._AC_UL640_QL65_ML3_.jpg","rating":{"rate":4.6,"count":400}}&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;h3 id="step-6-query-the-index"&gt;Step 6: Query the index&lt;/h3&gt; 
&lt;p&gt;Run the following API call to test semantic search by using the Nova Multimodal Embeddings model.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-json"&gt;GET /product/_search
{
  "_source": false,
  "fields": [
    "title",
    "price",
    "category",
    "image"
  ],
  "size": 3,
  "query": {
    "neural": {
      "title_vector": {
        "query_text": "jacket",
        "model_id": "&amp;lt;deployed model id&amp;gt;",
        "k": 5
      }
    }
  }
}&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;The output of the preceding query should look like the following.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-json"&gt;{
  ...
  "hits": {
    "total": {
      "value": 4,
      "relation": "eq"
    },
    "max_score": 0.8333229,
    "hits": [
      {
        "_index": "product",
        "_id": "3",
        "_score": 0.8333229,
        "fields": {
          "image": [
            "https://fakestoreapi.com/img/71li-ujtlUL._AC_UX679_.jpg"
          ],
          "title": [
            "Mens Cotton Jacket"
          ],
          "category": [
            "men's clothing"
          ],
          "price": [
            55.99
          ]
        }
      }
      ...
    ]
  }
}&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;h3 id="step-7-create-an-agent-with-strands-and-bedrock-agentcore-runtime"&gt;Step 7: Create an agent with Strands and Bedrock AgentCore Runtime&lt;/h3&gt; 
&lt;p&gt;Now, create the Strands Agent that uses &lt;a href="https://aws.amazon.com/bedrock/anthropic/" target="_blank" rel="noopener"&gt;Anthropic Claude Sonnet 4.6 on Amazon Bedrock&lt;/a&gt; to search products from the OpenSearch Service index. To do so:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Import the Runtime app with &lt;code&gt;from bedrock_agentcore.runtime import BedrockAgentCoreApp&lt;/code&gt;.&lt;/li&gt; 
 &lt;li&gt;Initialize the app in your code with &lt;code&gt;app = BedrockAgentCoreApp()&lt;/code&gt;.&lt;/li&gt; 
 &lt;li&gt;Create the OpenSearch Service connection and search query with the &lt;code&gt;@tool&lt;/code&gt; decorator.&lt;/li&gt; 
 &lt;li&gt;Decorate the invocation function with the &lt;code&gt;@app.entrypoint&lt;/code&gt; decorator.&lt;/li&gt; 
 &lt;li&gt;Let AgentCore Runtime control the running of the agent with &lt;code&gt;app.run()&lt;/code&gt;.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;Now, complete the following steps:&lt;/p&gt; 
&lt;ol type="1"&gt; 
 &lt;li&gt;Make sure that you have installed the necessary dependencies from the &lt;strong&gt;Prerequisites&lt;/strong&gt; section of this post.&lt;/li&gt; 
 &lt;li&gt;Create and save a file named &lt;code&gt;search_agent.py&lt;/code&gt; with the following code. Replace &lt;code&gt;&amp;lt;your hostname&amp;gt;&lt;/code&gt;, &lt;code&gt;&amp;lt;your region&amp;gt;&lt;/code&gt;, and &lt;code&gt;&amp;lt;your account id&amp;gt;&lt;/code&gt; placeholders within the code. 
  &lt;div class="hide-language"&gt; 
   &lt;pre&gt;&lt;code class="language-python"&gt;from strands import Agent, tool
import argparse
import json
from bedrock_agentcore.runtime import BedrockAgentCoreApp
from strands.models import BedrockModel
import boto3
from opensearchpy import OpenSearch, RequestsHttpConnection
from requests_aws4auth import AWS4Auth
app = BedrockAgentCoreApp()
@tool
def search_products(query: str, size: int = 5):
try:
# OpenSearch configuration
host = '' ## CHANGE THIS, DOMAIN ENDPOINT WITHOUT HTTPS!
region = '' ##CHANGE THIS
model_id= '' ##CHANGE THIS with your deployed model id in OpenSearch
service = 'es'
credentials = boto3.Session().get_credentials()
awsauth = AWS4Auth(credentials.access_key, credentials.secret_key, region, service, session_token=credentials.token)
# Create OpenSearch client
client = OpenSearch(
hosts=[{'host': host, 'port': 443}],
http_auth=awsauth,
use_ssl=True,
verify_certs=True,
connection_class=RequestsHttpConnection
)
"""Search products in OpenSearch using neural search"""
search_body = {
"_source": False,
"fields": ["title", "price", "category", "image"],
"size": size,
"query": {
"neural": {
"title_vector": {
"query_text": query,
"model_id": model_id,
"k": 3
}
}
}
}
response = client.search(
body=search_body,
index="product"
)
products = []
for hit in response['hits']['hits']:
fields = hit.get('fields', {})
product = {
'title': fields.get('title', [''])[0] if fields.get('title') else '',
'price': fields.get('price', [''])[0] if fields.get('price') else '',
'category': fields.get('category', [''])[0] if fields.get('category') else '',
'image': fields.get('image', [''])[0] if fields.get('image') else ''
}
products.append(product)
return f"Found {len(products)} products: {json.dumps(products, indent=2)}"
except Exception as e:
return f"Search error: {str(e)}"
model_id = "global.anthropic.claude-haiku-4-5-20251001-v1:0"
model = BedrockModel(
model_id=model_id,
)
agent = Agent(
model=model,
tools=[search_products],
system_prompt="You're a helpful assistant. You can do product search, and tell the product details."
)
@app.entrypoint
def strands_agent_bedrock(payload):
"""
Invoke the agent with a payload
"""
user_input = payload.get("prompt")
print("User input:", user_input)
response = agent(user_input)
return response.message['content'][0]['text']
if __name__ == "__main__":
#strands_agent_bedrock({"prompt": "Search jacket"}) ##UNCOMMENT THIS FOR TESTING
#app.run() ##UNCOMMENT THIS FOR DEPLOYMENT, MAKE SURE THE ABOVE LINE IS COMMENTED WHEN YOU ARE DEPLOYING TO AGENTCORE&lt;/code&gt;&lt;/pre&gt; 
  &lt;/div&gt; &lt;p&gt;This deploys your agent locally for testing purposes.&lt;/p&gt;&lt;/li&gt; 
 &lt;li&gt;Navigate to the IAM console and add the &lt;strong&gt;AmazonBedrockLimitedAccess&lt;/strong&gt; permission policy to the principal running the code.&lt;/li&gt; 
 &lt;li&gt;Navigate to OpenSearch Dashboards and, from the left menu, choose &lt;strong&gt;Security&lt;/strong&gt; plugin, then choose &lt;strong&gt;Roles&lt;/strong&gt;.&lt;/li&gt; 
 &lt;li&gt;Choose &lt;strong&gt;Create Role&lt;/strong&gt;.&lt;/li&gt; 
 &lt;li&gt;Name the role &lt;strong&gt;agentcore-permissions&lt;/strong&gt;.&lt;/li&gt; 
 &lt;li&gt;Under cluster permissions, add &lt;code&gt;cluster:admin/opensearch/ml/models/get&lt;/code&gt; and &lt;code&gt;cluster:admin/opensearch/ml/predict&lt;/code&gt;.&lt;/li&gt; 
 &lt;li&gt;Under index permissions, enter &lt;code&gt;product*&lt;/code&gt; as the index pattern. Add &lt;code&gt;search&lt;/code&gt; and &lt;code&gt;get&lt;/code&gt; permissions.&lt;/li&gt; 
 &lt;li&gt;Create the role.&lt;/li&gt; 
 &lt;li&gt;Choose the role you created, switch to the &lt;strong&gt;Mapped Users&lt;/strong&gt; tab, choose &lt;strong&gt;Manage mapping&lt;/strong&gt;, and add the role that you use for running the Python code as a backend role.&lt;/li&gt; 
 &lt;li&gt;Uncomment the line &lt;code&gt;strands_agent_bedrock({"prompt": "Search jacket"})&lt;/code&gt; and make sure the &lt;code&gt;app.run()&lt;/code&gt; line is commented in the code.&lt;/li&gt; 
 &lt;li&gt;Run &lt;code&gt;python search_agent.py&lt;/code&gt; in your terminal to start the shopping agent. The output should look similar to the following. 
  &lt;div class="hide-language"&gt; 
   &lt;pre&gt;&lt;code class="language-plaintext"&gt;"Here are the jacket search results:\n\n1. **Mens Cotton Jacket** - $55.99\n2. **Mens Casual Slim Fit** - $15.99\n3. **Mens Casual Premium Slim Fit T-Shirts** - $22.30\n4. **John Hardy Women's Legends Naga Gold &amp;amp; Silver Dragon Station Chain Bracelet** - $695.00\n\nThe most relevant jacket option is the **Mens Cotton Jacket** at $55.99. Would you like to know more about any of these products?&lt;/code&gt;&lt;/pre&gt; 
  &lt;/div&gt; &lt;/li&gt; 
 &lt;li&gt;Comment &lt;code&gt;strands_agent_bedrock({"prompt": "Search jacket"})&lt;/code&gt; and uncomment the &lt;code&gt;app.run()&lt;/code&gt; line in the code before going into the next step.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;h3 id="step-8-configure-and-launch-your-agent-to-bedrock-agentcore-runtime"&gt;Step 8: Configure and launch your agent to Bedrock AgentCore Runtime&lt;/h3&gt; 
&lt;p&gt;The &lt;a href="https://docs.aws.amazon.com/bedrock/latest/userguide/agentcore-cli.html"&gt;AgentCore CLI&lt;/a&gt; is a command-line tool provided by AWS that simplifies deployment of agents to Amazon Bedrock AgentCore Runtime. When you run the CLI deployment command, it automates the entire deployment workflow: it creates the necessary IAM execution role with proper permissions, packages your Python application code along with its dependencies, uses &lt;a href="https://aws.amazon.com/codebuild/"&gt;AWS CodeBuild&lt;/a&gt; to build an optimized Docker container image, pushes that container image to &lt;a href="https://aws.amazon.com/ecr/"&gt;Amazon Elastic Container Registry (ECR)&lt;/a&gt;, and finally provisions the AgentCore Runtime environment that hosts your containerized agent. This eliminates the need for manual Dockerfile creation, container builds, or infrastructure management.&lt;/p&gt; 
&lt;ol type="1"&gt; 
 &lt;li&gt;Before you start this step, make sure you have gone through section 7 and installed the AgentCore CLI and Python dependencies listed in the &lt;strong&gt;Prerequisites&lt;/strong&gt; section.&lt;/li&gt; 
 &lt;li&gt;Create a policy named &lt;strong&gt;AgentCoreAccessPolicy&lt;/strong&gt; with the following permissions and attach it to the role running the code. Replace &lt;code&gt;&amp;lt;ACCOUNT_ID&amp;gt;&lt;/code&gt; and &lt;code&gt;&amp;lt;REGION&amp;gt;&lt;/code&gt; placeholders.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-json"&gt;{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "BedrockAgentCoreServiceAccess",
      "Effect": "Allow",
      "Action": [
        "bedrock-agentcore:*"
      ],
      "Resource": "*"
    },
    {
      "Sid": "ECRAuthorizationToken",
      "Effect": "Allow",
      "Action": [
        "ecr:GetAuthorizationToken"
      ],
      "Resource": "*"
    },
    {
      "Sid": "ECRRepositoryAccess",
      "Effect": "Allow",
      "Action": [
        "ecr:BatchCheckLayerAvailability",
        "ecr:GetDownloadUrlForLayer",
        "ecr:BatchGetImage",
        "ecr:PutImage",
        "ecr:InitiateLayerUpload",
        "ecr:UploadLayerPart",
        "ecr:CompleteLayerUpload",
        "ecr:DescribeRepositories",
        "ecr:CreateRepository",
        "ecr:ListImages",
        "ecr:DescribeImages"
      ],
      "Resource": [
        "arn:aws:ecr:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNT_ID&amp;gt;:repository/bedrock-agentcore-*"
      ]
    },
    {
      "Sid": "CodeBuildProjectAccess",
      "Effect": "Allow",
      "Action": [
        "codebuild:CreateProject",
        "codebuild:UpdateProject",
        "codebuild:StartBuild",
        "codebuild:BatchGetBuilds",
        "codebuild:DeleteProject"
      ],
      "Resource": "*"
    },
    {
      "Sid": "IAMRoleManagement",
      "Effect": "Allow",
      "Action": [
        "iam:CreateRole",
        "iam:AttachRolePolicy",
        "iam:PutRolePolicy",
        "iam:GetRole",
        "iam:GetRolePolicy",
        "iam:PassRole",
        "iam:DeleteRole",
        "iam:DeleteRolePolicy",
        "iam:DetachRolePolicy",
        "iam:CreateServiceLinkedRole"
      ],
      "Resource": [
        "arn:aws:iam::&amp;lt;ACCOUNT_ID&amp;gt;:role/BedrockAgentCoreExecutionRole-*",
        "arn:aws:iam::&amp;lt;ACCOUNT_ID&amp;gt;:role/AmazonBedrockAgentCoreSDKCodeBuild-*",
        "arn:aws:iam::&amp;lt;ACCOUNT_ID&amp;gt;:role/aws-service-role/runtime-identity.bedrock-agentcore.amazonaws.com/AWSServiceRoleForBedrockAgentCoreRuntimeIdentity"
      ]
    },
    {
      "Sid": "CloudWatchLogsAccess",
      "Effect": "Allow",
      "Action": [
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents",
        "logs:PutResourcePolicy"
      ],
      "Resource": "arn:aws:logs:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNT_ID&amp;gt;:log-group:/aws/bedrock-agentcore/*"
    },
    {
      "Sid": "CloudWatchLogsResourcePolicy",
      "Effect": "Allow",
      "Action": [
        "logs:PutResourcePolicy"
      ],
      "Resource": "*"
    },
    {
      "Sid": "S3BucketManagement",
      "Effect": "Allow",
      "Action": [
        "s3:CreateBucket",
        "s3:PutBucketPolicy",
        "s3:PutBucketVersioning",
        "s3:PutBucketPublicAccessBlock",
        "s3:PutLifecycleConfiguration",
        "s3:PutObject",
        "s3:GetObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::bedrock-agentcore-codebuild-sources-&amp;lt;ACCOUNT_ID&amp;gt;-*",
        "arn:aws:s3:::bedrock-agentcore-codebuild-sources-&amp;lt;ACCOUNT_ID&amp;gt;-*/*"
      ]
    }
  ]
}&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;ol start="3" type="1"&gt; 
 &lt;li&gt;Create a file named &lt;code&gt;agentcore.yaml&lt;/code&gt; in your project directory with the following configuration. Replace &lt;code&gt;&amp;lt;REGION&amp;gt;&lt;/code&gt;, &lt;code&gt;&amp;lt;ACCOUNT_ID&amp;gt;&lt;/code&gt;, and &lt;code&gt;&amp;lt;OPENSEARCH_DOMAIN_NAME&amp;gt;&lt;/code&gt; placeholders:&lt;/li&gt; 
&lt;/ol&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-python"&gt;# AgentCore Runtime Configuration
  runtime:
    name: shopping-search-agent-runtime
    entrypoint: search_agent.py:strands_agent_bedrock
    region: &amp;lt;REGION&amp;gt;  # CHANGE THIS - Your AWS region (e.g., us-east-1)

    execution_role:
      create: true
      name: BedrockAgentCoreExecutionRole-shopping-agent
      policies:
        - policy_name: BedrockAndOpenSearchAccess
          policy_document:
            Version: "2012-10-17"
            Statement:
              - Sid: BedrockModelAccess
                Effect: Allow
                Action:
                  - bedrock:InvokeModel
                  - bedrock:InvokeModelWithResponseStream
                Resource:
                  - arn:aws:bedrock:*::foundation-model/*
                  - arn:aws:bedrock:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNT_ID&amp;gt;:inference-profile/*  # CHANGE THIS - Replace &amp;lt;REGION&amp;gt; and &amp;lt;ACCOUNT_ID&amp;gt;
              - Sid: OpenSearchAccess
                Effect: Allow
                Action:
                  - es:ESHttpGet
                  - es:ESHttpPost
                  - es:ESHttpPut
                Resource: arn:aws:es:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNT_ID&amp;gt;:domain/&amp;lt;OPENSEARCH_DOMAIN_NAME&amp;gt;/*  # CHANGE THIS - Replace all three placeholders
              - Sid: CloudWatchLogsAccess
                Effect: Allow
                Action:
                  - logs:CreateLogGroup
                  - logs:CreateLogStream
                  - logs:PutLogEvents
                Resource: arn:aws:logs:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNT_ID&amp;gt;:log-group:/aws/bedrock-agentcore/runtimes/*  # CHANGE THIS - Replace &amp;lt;REGION&amp;gt; and &amp;lt;ACCOUNT_ID&amp;gt;
              - Sid: ECRImageAccess
                Effect: Allow
                Action:
                  - ecr:GetAuthorizationToken
                  - ecr:BatchGetImage
                  - ecr:GetDownloadUrlForLayer
                Resource: "*"

    container:
      architecture: arm64  # Options: arm64 or x86_64
      requirements_file: requirements.txt

    ecr:
      auto_create: true&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;ol start="4" type="1"&gt; 
 &lt;li&gt;Run the following command in your terminal to deploy the agent to AgentCore Runtime: 
  &lt;ul type="1"&gt; 
   &lt;li&gt;Create the AgentCore project and replace with your search agent. 
    &lt;div class="hide-language"&gt; 
     &lt;pre&gt;&lt;code class="lang-python"&gt;agentcore create --name ShoppingAgent --defaults
cd ShoppingAgent
cp ../search_agent.py app/ShoppingAgent/main.py&lt;/code&gt;&lt;/pre&gt; 
    &lt;/div&gt; &lt;/li&gt; 
   &lt;li&gt;Add OpenSearch and other dependencies: 
    &lt;div class="hide-language"&gt; 
     &lt;pre&gt;&lt;code class="lang-python"&gt;cd app/ShoppingAgent
uv add opensearch-py requests-aws4auth boto3
cd ../..&lt;/code&gt;&lt;/pre&gt; 
    &lt;/div&gt; &lt;/li&gt; 
   &lt;li&gt;Deploy the agent to Agentcore Runtime. This process takes approximately 5-10 minutes. 
    &lt;div class="hide-language"&gt; 
     &lt;pre&gt;&lt;code class="lang-python"&gt;agentcore deploy&lt;/code&gt;&lt;/pre&gt; 
    &lt;/div&gt; &lt;/li&gt; 
   &lt;li&gt;Once deployment completes, verify the runtime status: 
    &lt;div class="hide-language"&gt; 
     &lt;pre&gt;&lt;code class="lang-python"&gt;agentcore status&lt;/code&gt;&lt;/pre&gt; 
    &lt;/div&gt; &lt;p&gt;You should see:&lt;/p&gt; 
    &lt;div class="hide-language"&gt; 
     &lt;pre&gt;&lt;code class="lang-code"&gt;ShoppingAgent: Deployed - Runtime: READY (arn:aws:bedrock-agentcore:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNT_ID&amp;gt;:runtime/ShoppingAgent_...)

URL: https://bedrock-agentcore.&amp;lt;REGION&amp;gt;.amazonaws.com/runtimes/.../invocations&lt;/code&gt;&lt;/pre&gt; 
    &lt;/div&gt; &lt;/li&gt; 
  &lt;/ul&gt; &lt;/li&gt; 
&lt;/ol&gt; 
&lt;h3 id="step-9-configure-opensearch-service-access"&gt;Step 9: Configure OpenSearch Service access&lt;/h3&gt; 
&lt;p&gt;Map your AgentCore execution role to an OpenSearch backend role so the agent can access your data.&lt;/p&gt; 
&lt;ol type="1"&gt; 
 &lt;li&gt;Navigate to OpenSearch Dashboards. From the left menu, choose the &lt;strong&gt;Security&lt;/strong&gt; plugin, then choose &lt;strong&gt;Roles&lt;/strong&gt;.&lt;/li&gt; 
 &lt;li&gt;Search for &lt;strong&gt;agentcore-permissions&lt;/strong&gt; and choose the role. Then, navigate to the &lt;strong&gt;Mapped Users&lt;/strong&gt; tab, choose &lt;strong&gt;Manage mapping&lt;/strong&gt;, and add &lt;code&gt;arn:aws:iam::&amp;lt;ACCOUNT_ID&amp;gt;:role/AmazonBedrockAgentCoreSDKRuntime-us-east-1-custom&lt;/code&gt; as a backend role. Replace the &lt;code&gt;&amp;lt;ACCOUNT_ID&amp;gt;&lt;/code&gt; placeholder with your account ID.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/06/03/BDB-5606-3.gif" alt="Animated demo of OpenSearch Dashboards Security plugin showing the agent-permissions role with the Mapped Users tab open and the AgentCore SDK runtime IAM role added as a backend role." width="600"&gt;&lt;/p&gt; 
&lt;h3 id="step-10-invoke-the-bedrock-agentcore-runtime"&gt;Step 10: Invoke the Bedrock AgentCore Runtime&lt;/h3&gt; 
&lt;p&gt;You can test the agent in &lt;a href="https://us-east-1.console.aws.amazon.com/bedrock-agentcore/playground?region=us-east-1" target="_blank" rel="noopener"&gt;Agent Sandbox&lt;/a&gt;. Enter the prompt &lt;code&gt;Search jacket less than 50$&lt;/code&gt;, and the agent returns the relevant result from the OpenSearch Service index with a summary.&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-91995" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/06/09/bdb-5606-last-image.png" alt="Agent Sandbox console showing a shopping agent response that returns the Mens Cotton Jacket as the relevant result for the prompt “Search jacket less than 50$”." width="1384" height="660"&gt;&lt;/p&gt; 
&lt;p&gt;In real-world scenarios, you can design a search application with a Strands Agent deployed in AgentCore Runtime. You can add &lt;a href="https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/memory.html" target="_blank" rel="noopener"&gt;AgentCore Memory&lt;/a&gt;, which gives your AI agents the ability to remember past interactions and provide more context-aware, personalized conversations.&lt;/p&gt; 
&lt;h2 id="cleanup"&gt;Cleanup&lt;/h2&gt; 
&lt;p&gt;To avoid incurring future charges, delete the resources created while building this solution:&lt;/p&gt; 
&lt;ol type="1"&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/gsgdeleting.html" target="_blank" rel="noopener"&gt;Delete the OpenSearch Service domain&lt;/a&gt;.&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/bedrock-agentcore-control/latest/APIReference/API_DeleteAgentRuntime.html" target="_blank" rel="noopener"&gt;Delete the Amazon Bedrock AgentCore Runtime&lt;/a&gt; resources.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;h2 id="conclusion"&gt;Conclusion&lt;/h2&gt; 
&lt;p&gt;In this post, you saw how to create a conversational search with Amazon OpenSearch Service and Strands Agents. You also learned how to deploy the agent on Amazon Bedrock AgentCore Runtime. You can further enhance this shopping agent by using other AgentCore capabilities. For example, &lt;a href="https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/memory.html" target="_blank" rel="noopener"&gt;AgentCore Memory&lt;/a&gt; retains user preferences and past interactions across sessions, &lt;a href="https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/identity.html" target="_blank" rel="noopener"&gt;AgentCore Identity&lt;/a&gt; manages shopper authentication and access control, and &lt;a href="https://docs.aws.amazon.com/bedrock-agentcore/latest/devguide/observability.html" target="_blank" rel="noopener"&gt;AgentCore Observability&lt;/a&gt; helps you monitor and debug agent behavior in production. Together, these services help you build shopping experiences that deliver instant, relevant assistance at scale.&lt;/p&gt; 
&lt;p&gt;Now it’s your turn. Build your own conversational search experience by integrating OpenSearch Service and Strands Agents with your product catalog. To learn more, see the &lt;a href="https://aws.amazon.com/opensearch-service/" target="_blank" rel="noopener"&gt;Amazon OpenSearch Service&lt;/a&gt; and &lt;a href="https://aws.amazon.com/bedrock/agentcore/" target="_blank" rel="noopener"&gt;Amazon Bedrock AgentCore&lt;/a&gt; detail pages.&lt;/p&gt; 
&lt;hr&gt; 
&lt;h2&gt;About the authors&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt; 
   &lt;p&gt;&lt;img loading="lazy" class="alignleft size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/06/03/BDB-5606-5.jpg" alt="Omama Khurshid" width="100" height="100"&gt;&lt;/p&gt; 
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Omama Khurshid&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/omama-khurshid-66b1085/" target="_blank" rel="noopener"&gt;Omama&lt;/a&gt; is an GTM Specialist Solutions Architect Analytics at Amazon Web Services. She focuses on helping customers across various industries build reliable, scalable, and efficient solutions. Outside of work, she enjoys spending time with her family, listening to music, and learning new technologies.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt; 
   &lt;p&gt;&lt;img loading="lazy" class="alignleft size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/06/03/BDB-5606-6.jpeg" alt="Jumana Nagaria" width="100" height="100"&gt;&lt;/p&gt; 
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Jumana Nagaria&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/jumana-nagaria/" target="_blank" rel="noopener"&gt;Jumana&lt;/a&gt; is a Prototyping Architect at AWS. She builds innovative prototypes with customers to solve their business challenges. She is passionate about cloud computing and data analytics. Outside of work, Jumana enjoys travelling, reading, painting, and spending quality time with friends and family.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt; 
   &lt;p&gt;&lt;img loading="lazy" class="alignleft size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/06/03/BDB-5606-7.jpg" alt="Canberk Keles" width="100" height="100"&gt;&lt;/p&gt; 
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Canberk Keles&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/canberkkeles/" target="_blank" rel="noopener"&gt;Canberk&lt;/a&gt; is a Solutions Architect at Amazon Web Services, helping software companies achieve their business goals by leveraging AWS technologies. He is part of OpenSearch specialist community within AWS and has been guiding customers harness the power of OpenSearch. Outside of work, he enjoys sports, reading, traveling and playing video games.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
		
		
			</item>
		<item>
		<title>Choosing the right workflow orchestration service for your use case: Amazon MWAA and AWS Step Functions</title>
		<link>https://aws.amazon.com/blogs/big-data/choosing-the-right-workflow-orchestration-service-for-your-use-case-amazon-mwaa-and-aws-step-functions/</link>
		
		<dc:creator><![CDATA[Rajkumar Raghuwanshi]]></dc:creator>
		<pubDate>Wed, 10 Jun 2026 15:32:13 +0000</pubDate>
				<category><![CDATA[Advanced (300)]]></category>
		<category><![CDATA[Amazon Managed Workflows for Apache Airflow (Amazon MWAA)]]></category>
		<category><![CDATA[AWS Step Functions]]></category>
		<category><![CDATA[Technical How-to]]></category>
		<guid isPermaLink="false">718b592ab4ad51e15456978c88e0c10604592609</guid>

					<description>This post explores how to select the right workflow orchestration service based on your specific use case requirements. We'll examine key workflow characteristics, present real-world scenarios, and provide practical guidance to help you make an informed decision for your particular needs.</description>
										<content:encoded>&lt;p&gt;Whether you’re processing financial data, managing e-commerce orders, or training machine learning (ML) models, efficiently coordinating complex processes is essential. Amazon Web Services (AWS) offers two services for workflow orchestration: &lt;a href="https://aws.amazon.com/managed-workflows-for-apache-airflow/" target="_blank" rel="noopener noreferrer"&gt;Amazon Managed Workflows for Apache Airflow&lt;/a&gt; (Amazon MWAA) and &lt;a href="https://aws.amazon.com/step-functions/" target="_blank" rel="noopener noreferrer"&gt;AWS Step Functions&lt;/a&gt;.&lt;/p&gt; 
&lt;p&gt;This post explores how to select the right workflow orchestration service based on your specific use case requirements. We’ll examine key workflow characteristics, present real-world scenarios, and provide practical guidance to help you make an informed decision for your particular needs.&lt;/p&gt; 
&lt;h2&gt;Understanding workflow orchestration requirements&lt;/h2&gt; 
&lt;p&gt;Before exploring specific services, consider the key dimensions that influence workflow orchestration needs:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Data statefulness&lt;/strong&gt;: Does your workflow process independent units of work (stateless) or create dependencies where each step modifies data from previous steps (stateful)?&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Execution duration&lt;/strong&gt;: Are your workflows short-lived (seconds to minutes) or long-running (hours to days)?&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Scheduling requirements&lt;/strong&gt;: Do you need built-in time-based execution or rely primarily on event triggers?&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Recovery capabilities&lt;/strong&gt;: How critical is the ability to restart from specific failure points rather than reprocessing entirely?&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Integration complexity&lt;/strong&gt;: What systems, services, and data sources need to be coordinated?&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Security and access control&lt;/strong&gt;: Do you need fine-grained permissions for different workflow components?&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;Let’s explore how these requirements map to real-world use cases and the appropriate orchestration solutions.&lt;/p&gt; 
&lt;h2&gt;Use case: Enterprise data analytics pipeline&lt;/h2&gt; 
&lt;p&gt;This scenario illustrates how Amazon MWAA handles complex, stateful data pipelines with built-in scheduling and granular recovery.&lt;/p&gt; 
&lt;h3&gt;Business challenge&lt;/h3&gt; 
&lt;p&gt;A global financial services company processes massive volumes of transaction data daily, requiring sophisticated data analytics capabilities. Their requirements include:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Designed to process 5-10 TB of financial transaction data daily&lt;/li&gt; 
 &lt;li&gt;Running complex extract, transform, and load (ETL) jobs with multiple transformation stages&lt;/li&gt; 
 &lt;li&gt;Generating regulatory reports for compliance use cases&lt;/li&gt; 
 &lt;li&gt;Supporting both scheduled batch processing and event-driven workflows&lt;/li&gt; 
 &lt;li&gt;Capable of handling long-running jobs that can take up to 12 hours&lt;/li&gt; 
 &lt;li&gt;Ensuring data consistency and integrity throughout the pipeline&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h3&gt;Workflow characteristics&lt;/h3&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Data statefulness&lt;/strong&gt;: Highly stateful workflows where each processing step modifies transaction data, creating dependencies throughout the pipeline&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Execution duration&lt;/strong&gt;: Supports long-running processes extending 2-12 hours&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Scheduling needs&lt;/strong&gt;: Mixed time-based and event-driven patterns&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Recovery requirements&lt;/strong&gt;: Critical ability to resume from specific failure points&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Integration complexity&lt;/strong&gt;: Orchestrates multiple AWS services and external systems&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;Solution: Amazon Managed Workflows for Apache Airflow (Amazon MWAA)&lt;/h2&gt; 
&lt;p&gt;For this enterprise data analytics scenario, Amazon MWAA provides capabilities that align well with these requirements:&lt;/p&gt; 
&lt;h3&gt;Stateful workflow management&lt;/h3&gt; 
&lt;p&gt;MWAA excels at managing complex, stateful data pipelines where data consistency is critical. When processing terabytes of financial data, MWAA’s ability to resume from the last successful checkpoint helps prevent costly reprocessing and maintain data integrity.&lt;/p&gt; 
&lt;p&gt;The following code example demonstrates how to structure a complex financial ETL pipeline in MWAA:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-python"&gt;# Example: Complex ETL pipeline with proper dependency management
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from datetime import datetime, timedelta

dag = DAG(
	'financial_etl_pipeline',
	schedule_interval='0 2 * * *',  # Daily at 2 AM
	start_date=datetime(2024, 1, 1),
	catchup=False
)

# Define tasks
extract_transactions = PythonOperator(task_id='extract_transactions', ...)
extract_market_data = PythonOperator(task_id='extract_market_data', ...)
transform_data = PythonOperator(task_id='transform_data', ...)
load_warehouse = PythonOperator(task_id='load_warehouse', ...)
generate_reports = PythonOperator(task_id='generate_reports', ...)

# Express complex dependencies clearly
[extract_transactions, extract_market_data] &amp;gt;&amp;gt; transform_data &amp;gt;&amp;gt; [load_warehouse, generate_reports]&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;This &lt;a href="https://docs.aws.amazon.com/mwaa/latest/userguide/working-dags.html" target="_blank" rel="noopener noreferrer"&gt;Directed Acyclic Graph&lt;/a&gt; (DAG) shows how to define task dependencies for parallel data extraction followed by sequential transformation and loading operations. The &amp;gt;&amp;gt; operator clearly defines the workflow dependencies. Transformation only begins after both extraction tasks complete successfully.&lt;/p&gt; 
&lt;h4&gt;Built-in scheduling capabilities&lt;/h4&gt; 
&lt;p&gt;MWAA includes native scheduling capabilities, making it straightforward to set up recurring workflows without additional services. The schedule_interval parameter in the DAG definition provides flexible scheduling options using cron syntax.&lt;/p&gt; 
&lt;h4&gt;Granular recovery and resume control&lt;/h4&gt; 
&lt;p&gt;During production incidents, operations teams can use the MWAA web interface to restart or bypass specific steps with a few clicks. This capability is important for stateful applications where restarting the entire workflow could compromise data consistency.&lt;/p&gt; 
&lt;p&gt;The MWAA web interface provides a visual representation of the workflow execution, allowing operators to:&lt;/p&gt; 
&lt;p&gt;Identify failed tasks – Examine task logs for troubleshooting – Clear the status of specific tasks – Restart execution from specific points&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/23/BDB-4711-1.png"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;Figure 1: A Directed Acyclic Graph (DAG) in MWAA showing parallel execution of&lt;/em&gt;&lt;a href="https://docs.aws.amazon.com/redshift/latest/mgmt/data-api.html" target="_blank" rel="noopener noreferrer"&gt;&lt;em&gt;Amazon Redshift Data API&lt;/em&gt;&lt;/a&gt;&lt;em&gt;tasks. If any task fails, you can re-run specific tasks rather than restarting from the beginning.&lt;/em&gt;&lt;/p&gt; 
&lt;h4&gt;Comprehensive monitoring and operational control&lt;/h4&gt; 
&lt;p&gt;MWAA’s metadata server maintains comprehensive execution logs, enabling organizations to build operational dashboards for: – Real-time workflow monitoring – Task completion rate tracking – Pipeline execution pattern analysis – Optimization opportunity identification&lt;/p&gt; 
&lt;h3&gt;Implementation considerations&lt;/h3&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Infrastructure planning&lt;/strong&gt;: While MWAA requires capacity planning, the automatic scaling capabilities effectively handle variable workloads by setting minimum and maximum worker counts.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Security model&lt;/strong&gt;: MWAA uses a shared execution role across DAGs, but you can implement additional security through resource-level policies and separate environments for different teams.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Cost predictability&lt;/strong&gt;: The worker-hour pricing model provides predictable costs for long-running jobs, making budget planning more straightforward.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;Use case: Real-time serverless application orchestration&lt;/h2&gt; 
&lt;p&gt;This scenario shows how AWS Step Functions handles event-driven, serverless workflows that need to scale automatically with unpredictable traffic.&lt;/p&gt; 
&lt;h3&gt;Business challenge&lt;/h3&gt; 
&lt;p&gt;An e-commerce platform needs to orchestrate real-time order processing workflows that can handle thousands of concurrent orders during peak shopping periods. Their requirements include:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Designed for processing customer orders in real-time (targeting sub-second response times)&lt;/li&gt; 
 &lt;li&gt;Coordinating payment validation, inventory checks, and fulfillment&lt;/li&gt; 
 &lt;li&gt;Integrating with multiple AWS services (&lt;a href="https://aws.amazon.com/lambda/" target="_blank" rel="noopener noreferrer"&gt;AWS Lambda&lt;/a&gt;, &lt;a href="https://aws.amazon.com/sqs/" target="_blank" rel="noopener noreferrer"&gt;Amazon Simple Queue Service&lt;/a&gt; (Amazon SQS), &lt;a href="https://aws.amazon.com/sns/" target="_blank" rel="noopener noreferrer"&gt;Amazon Simple Notification Service&lt;/a&gt; (Amazon SNS), &lt;a href="https://aws.amazon.com/dynamodb/" target="_blank" rel="noopener noreferrer"&gt;Amazon DynamoDB&lt;/a&gt;)&lt;/li&gt; 
 &lt;li&gt;Designed to handle traffic spikes during promotional events&lt;/li&gt; 
 &lt;li&gt;Implementing approval workflows for high-value orders&lt;/li&gt; 
 &lt;li&gt;Maintaining cost efficiency during variable load periods&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h3&gt;Workflow characteristics&lt;/h3&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Data statefulness&lt;/strong&gt;: Primarily stateless processing where each customer order represents an independent transaction&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Execution duration&lt;/strong&gt;: Supports rapid, real-time processing with sub-second to few-minute response times.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Event-driven nature&lt;/strong&gt;: Core architectural pattern where workflows are triggered by specific customer actions&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Integration requirements&lt;/strong&gt;: Extensive coordination with AWS serverless services&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Scalability needs&lt;/strong&gt;: Highly unpredictable traffic patterns requiring automatic scaling&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;Solution: AWS Step Functions&lt;/h2&gt; 
&lt;p&gt;For this real-time e-commerce scenario, AWS Step Functions provides capabilities that align well with these requirements:&lt;/p&gt; 
&lt;h3&gt;Serverless architecture and automatic scaling&lt;/h3&gt; 
&lt;p&gt;Step Functions automatically scales to handle traffic spikes without infrastructure management. During peak shopping events like Black Friday, the service handles increased load without manual intervention.&lt;/p&gt; 
&lt;h3&gt;Event-driven workflow execution&lt;/h3&gt; 
&lt;p&gt;Step Functions is designed for order-triggered workflows that need immediate execution. The following JSON definition shows how to structure an e-commerce order processing workflow:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;{
  "Comment": "E-commerce Order Processing Workflow",
  "StartAt": "ValidatePayment",
  "States": {
    "ValidatePayment": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:region:account:function:ValidatePayment",
      "Retry": [
        {
          "ErrorEquals": ["States.TaskFailed"],
          "IntervalSeconds": 2,
          "MaxAttempts": 3,
          "BackoffRate": 2.0
        }
      ],
      "Next": "CheckInventory"
    },
    "CheckInventory": {
      "Type": "Parallel",
      "Branches": [
        {
          "StartAt": "CheckWarehouse1",
          "States": {
            "CheckWarehouse1": {
              "Type": "Task",
              "Resource": "arn:aws:lambda:region:account:function:CheckWarehouse",
              "End": true
            }
          }
        },
        {
          "StartAt": "CheckWarehouse2", 
          "States": {
            "CheckWarehouse2": {
              "Type": "Task",
              "Resource": "arn:aws:lambda:region:account:function:CheckWarehouse",
              "End": true
            }
          }
        }
      ],
      "Next": "ProcessOrder"
    },
    "ProcessOrder": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:region:account:function:ProcessOrder",
      "End": true
    }
  }
}&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;This Step Functions definition demonstrates several key capabilities: – The ValidatePayment state includes built-in retry logic with exponential backoff – The CheckInventory state uses parallel execution to simultaneously check multiple warehouses – Each Lambda function is called via its Amazon Resource Name (ARN), providing direct integration with AWS services&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/23/BDB-4711-2.png"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;Figure 2: A complex workflow in AWS Step Functions, involving multiple stages of data processing. The parallel execution doesn’t allow resuming from a specific mid-execution step, but the branching structure provides automated error handling and recovery.&lt;/em&gt;&lt;/p&gt; 
&lt;h4&gt;Native AWS service integration&lt;/h4&gt; 
&lt;p&gt;Step Functions provides direct integration with Lambda functions, SQS queues, SNS topics, and DynamoDB, eliminating the need for custom connectors or additional infrastructure components.&lt;/p&gt; 
&lt;h4&gt;Cost-effective pay-per-use model&lt;/h4&gt; 
&lt;p&gt;The pay-per-execution pricing model aligns with variable order volumes, keeping costs minimal during slow periods while scaling automatically during busy times.&lt;/p&gt; 
&lt;h4&gt;Human approval workflow support&lt;/h4&gt; 
&lt;p&gt;Step Functions supports human approval steps, making it suitable for high-value order workflows that require manual review or approval processes.&lt;/p&gt; 
&lt;h3&gt;Implementation considerations&lt;/h3&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Error handling&lt;/strong&gt;: Built-in retry mechanisms and error handling patterns help provide reliable order processing with configurable retry policies.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Visual monitoring&lt;/strong&gt;: The Step Functions console provides real-time visibility into order processing status, enabling quick identification of bottlenecks.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Security model&lt;/strong&gt;: Fine-grained &lt;a href="https://aws.amazon.com/iam/" target="_blank" rel="noopener noreferrer"&gt;AWS Identity and Access Management&lt;/a&gt; (IAM) roles per step so that payment processing functions have different permissions than inventory management functions.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;Choosing the right workflow orchestration service&lt;/h2&gt; 
&lt;p&gt;When selecting between Amazon MWAA and AWS Step Functions, consider these workflow characteristics:&lt;/p&gt; 
&lt;h3&gt;Consider Amazon MWAA when your use case involves:&lt;/h3&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Complex stateful data processing&lt;/strong&gt; where workflows modify data state and require recovery mechanisms to maintain consistency&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Long-running batch jobs&lt;/strong&gt; executing for hours or days where computational investment is substantial&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Built-in scheduling requirements&lt;/strong&gt; where regular batch processing needs time-based orchestration&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Granular recovery needs&lt;/strong&gt; where resuming from specific failure points is business-critical&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Complex task dependencies&lt;/strong&gt; involving sophisticated relationships between workflow tasks&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Existing Apache Airflow expertise&lt;/strong&gt; where teams have substantial investment in Apache Airflow knowledge&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h3&gt;Consider AWS Step Functions when your use case involves:&lt;/h3&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Event-driven serverless workflows&lt;/strong&gt; triggered by external events requiring immediate response&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Stateless processing&lt;/strong&gt; where each workflow execution operates independently&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Short to medium duration tasks&lt;/strong&gt; completing within minutes to hours&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Heavy AWS service integration&lt;/strong&gt; involving extensive coordination with Lambda functions and other AWS services&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Human approval workflows&lt;/strong&gt; requiring manual intervention or decision-making&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Variable load patterns&lt;/strong&gt; with unpredictable traffic requiring automatic scaling&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;Decision framework&lt;/h2&gt; 
&lt;p&gt;To help guide your decision process, consider the following questions:&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/23/BDB-4711-3.png"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;Figure 3: Decision tree guiding through key considerations for choosing between Amazon MWAA and AWS Step Functions based on workflow characteristics.&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/23/BDB-4711-4.png"&gt;&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;Figure 4: Comprehensive comparison between Amazon MWAA and AWS Step Functions, highlighting decision factors for choosing the right workflow orchestration service.&lt;/p&gt; 
&lt;h2&gt;Conclusion&lt;/h2&gt; 
&lt;p&gt;Both Amazon Managed Workflows for Apache Airflow and AWS Step Functions are workflow orchestration services, each designed to address specific use case requirements. By understanding your workflow characteristics and aligning them with the strengths of each service, you can make an informed decision that supports your business needs.&lt;/p&gt; 
&lt;p&gt;For complex, stateful workflows with long execution times and sophisticated recovery requirements, Amazon MWAA provides robust capabilities. For event-driven, serverless workflows with tight AWS integration and variable load patterns, AWS Step Functions is a strong fit.&lt;/p&gt; 
&lt;p&gt;Remember that these services are not mutually exclusive. Many organizations use both to address different workflow orchestration needs across their application portfolio. By focusing on your specific use case requirements, you can select the right tool for each job and build resilient, efficient workflow orchestration solutions on AWS.&lt;/p&gt; 
&lt;p&gt;If you have questions or feedback about choosing between these services, leave a comment.&lt;/p&gt; 
&lt;hr&gt; 
&lt;h2&gt;About the authors&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="aligncenter size-full wp-image-29797" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/23/BDB-4711-5.jpeg" alt="Rajkumar Raghuwanshi" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Rajkumar Raghuwanshi&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/rajkumarraghuwanshi/" target="_blank" rel="noopener"&gt;Rajkumar&lt;/a&gt; is a Delivery Consultant, within AWS Professional Services, specializing in helping customers design and optimize their&amp;nbsp;data and analytics workloads on AWS. With expertise spanning&amp;nbsp;database modernization, data migration, and analytics architecture, he builds scalable, cloud-native solutions that enable customers to unlock the full value of their data.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="aligncenter size-full wp-image-29797" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/23/BDB-4711-6.jpeg" alt="Shuvajit Ghosh" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Shuvajit Ghosh&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/shuvajitg/" target="_blank" rel="noopener"&gt;Shuvajit&lt;/a&gt; is a Delivery Consultant – Data &amp;amp; Analytics within AWS Professional Services, with over a decade of experience architecting enterprise-scale data warehouses, lakehouse platforms, and modern data ecosystems. He specializes in data lakehouse architectures, end-to-end ETL/ELT pipeline design, data lineage, and container-based solutions using services like Amazon Redshift, Amazon OpenSearch Service, AWS Glue, Lake Formation, Apache Iceberg, dbt, and Amazon MWAA.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="aligncenter size-full wp-image-29797" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/23/BDB-4711-7.jpeg" alt="Nishad" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Nishad Mankar&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/nishadmankar/" target="_blank" rel="noopener"&gt;Nishad&lt;/a&gt; is a Delivery Consultant with AWS Professional Services, passionate about helping customers harness the power of&amp;nbsp;data on the cloud. He brings deep expertise in&amp;nbsp;analytics architecture, data platform modernization, and database migration, enabling organizations to build robust, scalable solutions on AWS. From architecting&amp;nbsp;modern data pipelines&amp;nbsp;to optimizing complex workloads, Nishad partners closely with customers to accelerate their cloud journey and deliver measurable business outcomes.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
		
		
			</item>
		<item>
		<title>Real-time CDC from Aurora PostgreSQL to Amazon S3 Tables using Debezium and Firehose</title>
		<link>https://aws.amazon.com/blogs/big-data/real-time-cdc-from-aurora-postgresql-to-amazon-s3-tables-using-debezium-and-firehose/</link>
		
		<dc:creator><![CDATA[Chintan Agrawal]]></dc:creator>
		<pubDate>Wed, 10 Jun 2026 15:29:26 +0000</pubDate>
				<category><![CDATA[Advanced (300)]]></category>
		<category><![CDATA[Amazon Data Firehose]]></category>
		<category><![CDATA[Amazon S3 Tables]]></category>
		<category><![CDATA[AWS Database Migration Service]]></category>
		<category><![CDATA[AWS Lambda]]></category>
		<category><![CDATA[Technical How-to]]></category>
		<guid isPermaLink="false">822a3941445672e58662ee6ac44667d64e30b4ed</guid>

					<description>In this post, we show you how to build a CDC pipeline that delivers query-ready Iceberg tables directly. The pipeline captures inserts, updates, and deletes from Aurora PostgreSQL and applies them as row-level operations in Amazon S3 Tables, a capability of Amazon Simple Storage Service (Amazon S3).</description>
										<content:encoded>&lt;p&gt;Enterprises running transactional workloads on &lt;a href="https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Aurora.AuroraPostgreSQL.html" target="_blank" rel="noopener"&gt;Amazon Aurora PostgreSQL-Compatible Edition&lt;/a&gt; (Aurora PostgreSQL) need their operational data available for analytics. However, analytical queries and cross-database joins compete for resources on OLTP-optimized clusters. Batch exports introduce latency, and when data spans multiple Aurora clusters, there’s no straightforward way to join datasets or run cross-domain analytics. Real-time change data capture (CDC) addresses this by streaming row-level changes into a separate analytics layer. However, most CDC approaches write append-only records that require downstream consumers to reconstruct current state from the change log.&lt;/p&gt; 
&lt;p&gt;In this post, we show you how to build a CDC pipeline that delivers query-ready Iceberg tables directly. The pipeline captures inserts, updates, and deletes from Aurora PostgreSQL and applies them as row-level operations in &lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables.html" target="_blank" rel="noopener"&gt;Amazon S3 Tables&lt;/a&gt;, a capability of &lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html" target="_blank" rel="noopener"&gt;Amazon Simple Storage Service&lt;/a&gt; (Amazon S3). The destination tables always reflect the current state of the source database. You use &lt;a href="https://debezium.io/" target="_blank" rel="noopener"&gt;Debezium&lt;/a&gt; on &lt;a href="https://docs.aws.amazon.com/msk/latest/developerguide/msk-connect.html" target="_blank" rel="noopener"&gt;Amazon MSK Connect&lt;/a&gt; for change capture and &lt;a href="https://docs.aws.amazon.com/msk/latest/developerguide/what-is-msk.html" target="_blank" rel="noopener"&gt;Amazon Managed Streaming for Apache Kafka&lt;/a&gt; (Amazon MSK) for streaming. You also use &lt;a href="https://docs.aws.amazon.com/lambda/latest/dg/welcome.html" target="_blank" rel="noopener"&gt;AWS Lambda&lt;/a&gt; to transform CDC events and resolve operation semantics, and &lt;a href="https://docs.aws.amazon.com/firehose/latest/dev/what-is-this-service.html" target="_blank" rel="noopener"&gt;Amazon Data Firehose&lt;/a&gt; to deliver records into Iceberg tables. You deploy the infrastructure using the &lt;a href="https://docs.aws.amazon.com/cdk/v2/guide/home.html" target="_blank" rel="noopener"&gt;AWS Cloud Development Kit&lt;/a&gt; (AWS CDK).&lt;/p&gt; 
&lt;p&gt;&lt;a href="https://iceberg.apache.org/" target="_blank" rel="noopener"&gt;Apache Iceberg&lt;/a&gt; supports row-level updates, deletes, ACID transactions, schema evolution, and time travel natively. S3 Tables handles Iceberg snapshot management and compaction automatically. With &lt;a href="https://docs.aws.amazon.com/lake-formation/latest/dg/what-is-lake-formation.html" target="_blank" rel="noopener"&gt;AWS Lake Formation&lt;/a&gt; for access control, multiple teams can query the tables through &lt;a href="https://docs.aws.amazon.com/athena/latest/ug/what-is.html" target="_blank" rel="noopener"&gt;Amazon Athena&lt;/a&gt;, &lt;a href="https://docs.aws.amazon.com/redshift/latest/mgmt/welcome.html" target="_blank" rel="noopener"&gt;Amazon Redshift&lt;/a&gt;, or &lt;a href="https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/what-is-sagemaker-unified-studio.html" target="_blank" rel="noopener"&gt;Amazon SageMaker Unified Studio&lt;/a&gt;.&lt;/p&gt; 
&lt;h2 id="solution-overview"&gt;Solution overview&lt;/h2&gt; 
&lt;p&gt;The following diagram shows the architecture of the CDC pipeline.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/06/05/BDB-5821-1.png" alt="Figure 1. CDC pipeline architecture from Aurora PostgreSQL to Amazon S3 Tables." width="600"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;Figure 1. CDC pipeline architecture from Aurora PostgreSQL to Amazon S3 Tables.&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;The pipeline uses six components:&lt;/p&gt; 
&lt;ol type="1"&gt; 
 &lt;li&gt;&lt;strong&gt;Aurora PostgreSQL to Debezium.&lt;/strong&gt; Debezium runs on MSK Connect in your VPC and uses PostgreSQL’s native &lt;a href="https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraPostgreSQL.Replication.Logical.html" target="_blank" rel="noopener"&gt;logical replication&lt;/a&gt; to stream row-level changes from the write-ahead log (WAL), with minimal impact on query performance.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Debezium to Amazon MSK.&lt;/strong&gt; The &lt;code&gt;ByLogicalTableRouter&lt;/code&gt; &lt;a href="https://debezium.io/documentation/reference/stable/transformations/topic-routing.html" target="_blank" rel="noopener"&gt;SMT&lt;/a&gt; reroutes CDC events from multiple tables into a single topic (&lt;code&gt;aurora.cdc.all-tables&lt;/code&gt;), retaining the source table name in each message.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Amazon MSK to Firehose.&lt;/strong&gt; Firehose connects to the MSK cluster using the &lt;a href="https://docs.aws.amazon.com/msk/latest/developerguide/iam-access-control.html" target="_blank" rel="noopener"&gt;IAM access control&lt;/a&gt; over &lt;a href="https://docs.aws.amazon.com/vpc/latest/privatelink/what-is-privatelink.html" target="_blank" rel="noopener"&gt;AWS PrivateLink&lt;/a&gt; and continuously polls the topic for new messages.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Firehose to Lambda.&lt;/strong&gt; For each batch, Firehose invokes the Lambda function to decode the Kafka message, flatten the Debezium envelope, and set &lt;a href="https://docs.aws.amazon.com/firehose/latest/dev/apache-iceberg-format-input-record-different.html" target="_blank" rel="noopener"&gt;otfMetadata&lt;/a&gt; routing with the destination table and operation type.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Firehose to S3 Tables.&lt;/strong&gt; Firehose reads the &lt;code&gt;otfMetadata&lt;/code&gt;, routes each record to the correct Iceberg table, and performs the appropriate row-level operation using configured unique keys (for example, &lt;code&gt;order_id&lt;/code&gt; for orders). S3 Tables handles &lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-maintenance-compaction.html" target="_blank" rel="noopener"&gt;compaction&lt;/a&gt; and snapshot management automatically.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Query and access control.&lt;/strong&gt; After data lands in S3 Tables, you can query the Iceberg tables with &lt;a href="https://docs.aws.amazon.com/athena/latest/ug/what-is.html"&gt;Amazon Athena&lt;/a&gt;, &lt;a href="https://docs.aws.amazon.com/redshift/latest/mgmt/welcome.html"&gt;Amazon Redshift&lt;/a&gt;, or &lt;a href="https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/what-is-sagemaker-unified-studio.html"&gt;Amazon SageMaker Unified Studio&lt;/a&gt;, with &lt;a href="https://docs.aws.amazon.com/lake-formation/latest/dg/what-is-lake-formation.html"&gt;AWS Lake Formation&lt;/a&gt; managing fine-grained access control.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;Firehose supports one MSK topic per delivery stream. The single-topic routing pattern uses a Debezium SMT to consolidate multiple tables into one topic, and a Lambda function to route records to the correct destination. With this, you can serve multiple tables through one Firehose stream, reducing cost and operational complexity.&lt;/p&gt; 
&lt;h2 id="debezium-event-transformation"&gt;Debezium event transformation&lt;/h2&gt; 
&lt;p&gt;Debezium produces CDC events in an &lt;a href="https://debezium.io/documentation/reference/stable/connectors/postgresql.html#postgresql-events" target="_blank" rel="noopener"&gt;envelope structure&lt;/a&gt; containing both the previous and current state of a row, along with metadata about the source database, table, and operation type. However, Firehose expects records in a flattened JSON format with routing metadata that indicates the target table and operation type.&lt;/p&gt; 
&lt;p&gt;The Lambda function bridges this gap by performing three operations on each record:&lt;/p&gt; 
&lt;ol type="1"&gt; 
 &lt;li&gt;&lt;strong&gt;Decode.&lt;/strong&gt; When Firehose uses Amazon MSK as a source, it delivers the Kafka message value as a base64-encoded string in the &lt;code&gt;kafkaRecordValue&lt;/code&gt; field. The function base64-decodes this field to obtain the raw Debezium JSON payload.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Flatten and extract.&lt;/strong&gt; Pulls the row data from the Debezium envelope. For inserts and updates, the function uses the &lt;code&gt;after&lt;/code&gt; field (the row after the change). For deletes, it uses the &lt;code&gt;before&lt;/code&gt; field, because the &lt;code&gt;after&lt;/code&gt; field is null when a row is removed.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Route.&lt;/strong&gt; Sets the &lt;a href="https://docs.aws.amazon.com/firehose/latest/dev/apache-iceberg-format-input-record-different.html" target="_blank" rel="noopener"&gt;otfMetadata&lt;/a&gt; block with &lt;code&gt;destinationTableName&lt;/code&gt; (extracted from the Debezium &lt;code&gt;source.table&lt;/code&gt; field) and &lt;code&gt;operation&lt;/code&gt; (mapped from Debezium’s single-character codes to Firehose’s operation types).&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;The following table shows how Debezium operation codes map to Firehose Iceberg operations:&lt;/p&gt; 
&lt;table border="1px" width="100%" cellpadding="10px"&gt; 
 &lt;tbody&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;strong&gt;Debezium code&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Meaning&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Firehose operation&lt;/strong&gt;&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;c&lt;/td&gt; 
   &lt;td&gt;Row created (insert)&lt;/td&gt; 
   &lt;td&gt;insert&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;u&lt;/td&gt; 
   &lt;td&gt;Row updated&lt;/td&gt; 
   &lt;td&gt;update&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;d&lt;/td&gt; 
   &lt;td&gt;Row deleted&lt;/td&gt; 
   &lt;td&gt;delete&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;r&lt;/td&gt; 
   &lt;td&gt;Snapshot read (initial load)&lt;/td&gt; 
   &lt;td&gt;insert&lt;/td&gt; 
  &lt;/tr&gt; 
 &lt;/tbody&gt; 
&lt;/table&gt; 
&lt;p&gt;When Debezium starts with &lt;code&gt;snapshot.mode=initial&lt;/code&gt;, it reads all existing rows and emits them as &lt;code&gt;r&lt;/code&gt; (read) events. These represent rows that existed before CDC began, so they are mapped to &lt;code&gt;insert&lt;/code&gt; to establish the baseline state in the destination tables.&lt;/p&gt; 
&lt;p&gt;For example, the function transforms this Debezium envelope:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-json"&gt;{
"op": "c",
"before": null,
"after": {"order_id": 1, "customer_id": 1, "total_amount": 299.99},
"source": {"table": "orders", "db": "cdcdemo"}
}&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Into a response record with routing metadata:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-json"&gt;{
"recordId": "&amp;lt;original-record-id&amp;gt;",
"result": "Ok",
"kafkaRecordValue": "&amp;lt;base64-encoded flattened row JSON&amp;gt;",
"metadata": {
"otfMetadata": {
"destinationDatabaseName": "aurora_cdc",
"destinationTableName": "orders",
"operation": "insert"
}
}
}&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;The &lt;code&gt;kafkaRecordValue&lt;/code&gt; contains the base64-encoded flattened row data (for example, &lt;code&gt;{"order_id": 1, "customer_id": 1, "total_amount": 299.99}&lt;/code&gt;), and the &lt;code&gt;otfMetadata&lt;/code&gt; block tells Firehose which table to write to and which operation to perform.&lt;/p&gt; 
&lt;p&gt;With this routing metadata, a single Firehose stream can write to multiple destination tables. For more information, see &lt;a href="https://docs.aws.amazon.com/firehose/latest/dev/apache-iceberg-format-input-record-different.html" target="_blank" rel="noopener"&gt;Route incoming records to different Iceberg tables&lt;/a&gt;.&lt;/p&gt; 
&lt;h2 id="walkthrough"&gt;Walkthrough&lt;/h2&gt; 
&lt;p&gt;The following sections walk you through building the CDC pipeline end to end. Before you begin, complete the prerequisites.&lt;/p&gt; 
&lt;h3 id="prerequisites"&gt;Prerequisites&lt;/h3&gt; 
&lt;p&gt;Before you begin, make sure you have the following:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;An &lt;a href="https://aws.amazon.com/premiumsupport/knowledge-center/create-and-activate-aws-account/" target="_blank" rel="noopener"&gt;AWS account&lt;/a&gt; with permissions to create the resources described in this post.&lt;/li&gt; 
 &lt;li&gt;An existing &lt;a href="https://docs.aws.amazon.com/vpc/latest/userguide/what-is-amazon-vpc.html" target="_blank" rel="noopener"&gt;Amazon Virtual Private Cloud&lt;/a&gt; (Amazon VPC) with at least two subnets in different Availability Zones.&lt;/li&gt; 
 &lt;li&gt;An Aurora PostgreSQL cluster in the same VPC with &lt;a href="https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraPostgreSQL.Replication.Logical.html" target="_blank" rel="noopener"&gt;logical replication enabled&lt;/a&gt; (&lt;code&gt;rds.logical_replication = 1&lt;/code&gt;).&lt;/li&gt; 
 &lt;li&gt;Aurora database credentials stored in &lt;a href="https://docs.aws.amazon.com/secretsmanager/latest/userguide/intro.html" target="_blank" rel="noopener"&gt;AWS Secrets Manager&lt;/a&gt;. Note the secret ARN for the CDK configuration.&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/cdk/v2/guide/getting_started.html" target="_blank" rel="noopener"&gt;AWS CDK&lt;/a&gt; v2 installed (&lt;code&gt;npm install -g aws-cdk&lt;/code&gt;).&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://nodejs.org/" target="_blank" rel="noopener"&gt;Node.js&lt;/a&gt; 18+ and npm.&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html" target="_blank" rel="noopener"&gt;AWS Command Line Interface&lt;/a&gt; (AWS CLI) v2 installed and configured with appropriate credentials.&lt;/li&gt; 
 &lt;li&gt;An &lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/Welcome.html" target="_blank" rel="noopener"&gt;Amazon S3&lt;/a&gt; general purpose bucket for the Debezium plugin upload and Firehose failed record backup.&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-integrating-aws.html" target="_blank" rel="noopener"&gt;S3 Tables integration with AWS analytics services&lt;/a&gt; enabled in your AWS Region (one-time setup).&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h3 id="step-1-enable-cdc-in-aurora-postgresql"&gt;Step 1: Enable CDC in Aurora PostgreSQL&lt;/h3&gt; 
&lt;p&gt;PostgreSQL supports change data capture through its &lt;a href="https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/AuroraPostgreSQL.Replication.Logical.html" target="_blank" rel="noopener"&gt;logical replication&lt;/a&gt; framework, which allows database changes to be streamed from the write-ahead log (WAL). Debezium uses this mechanism to continuously read row-level changes and publish them to Kafka topics.&lt;/p&gt; 
&lt;p&gt;To enable logical replication in Aurora PostgreSQL, configure a custom &lt;a href="https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/USER_WorkingWithDBClusterParamGroups.html" target="_blank" rel="noopener"&gt;DB cluster parameter group&lt;/a&gt;:&lt;/p&gt; 
&lt;ol type="1"&gt; 
 &lt;li&gt;Create a custom parameter group and set the following parameter: &lt;code&gt;rds.logical_replication = 1&lt;/code&gt;.&lt;/li&gt; 
 &lt;li&gt;Apply the parameter group to your Aurora cluster and reboot the cluster for the change to take effect.&lt;/li&gt; 
 &lt;li&gt;Connect to your Aurora PostgreSQL cluster and create the source tables:&lt;/li&gt; 
&lt;/ol&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-sql"&gt;CREATE TABLE public.orders (
    order_id SERIAL PRIMARY KEY,
    customer_id INTEGER,
    order_date VARCHAR(50),
    total_amount DECIMAL(12,2),
    status VARCHAR(50),
    created_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW()
);
CREATE TABLE public.products (
    product_id SERIAL PRIMARY KEY,
    product_name VARCHAR(255),
    category VARCHAR(100),
    price DECIMAL(10,2),
    stock_quantity INTEGER,
    created_at TIMESTAMP DEFAULT NOW(),
    updated_at TIMESTAMP DEFAULT NOW()
);&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;ol start="4" type="1"&gt; 
 &lt;li&gt;Create a &lt;a href="https://www.postgresql.org/docs/current/logical-replication-publication.html" target="_blank" rel="noopener"&gt;publication&lt;/a&gt; that defines which tables are included in the change stream. Debezium automatically creates the &lt;a href="https://www.postgresql.org/docs/current/logicaldecoding-explanation.html#LOGICALDECODING-REPLICATION-SLOTS" target="_blank" rel="noopener"&gt;logical replication slot&lt;/a&gt; when the connector starts for the first time, so you don’t need to create one manually.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-sql"&gt;CREATE PUBLICATION dbz_publication FOR TABLE public.orders, public.products;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;ol start="5" type="1"&gt; 
 &lt;li&gt;Verify the publication was created:&lt;/li&gt; 
&lt;/ol&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-sql"&gt;SELECT * FROM pg_publication WHERE pubname = 'dbz_publication';&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;You should see one row returned, confirming the publication is active.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Important:&lt;/strong&gt; When the Debezium connector starts (Step 6), it creates a replication slot named &lt;code&gt;debezium_slot&lt;/code&gt;. This slot retains WAL segments until consumed. If the connector is stopped for an extended period, WAL segments can accumulate and increase storage usage on the Aurora cluster. Monitor the &lt;code&gt;ReplicationSlotDiskUsage&lt;/code&gt; &lt;a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/WhatIsCloudWatch.html" target="_blank" rel="noopener"&gt;Amazon CloudWatch&lt;/a&gt; metric for your Aurora cluster.&lt;/p&gt; 
&lt;h3 id="step-2-build-and-register-the-debezium-plugin"&gt;Step 2: Build and register the Debezium plugin&lt;/h3&gt; 
&lt;p&gt;&lt;a href="https://docs.aws.amazon.com/msk/latest/developerguide/msk-connect.html" target="_blank" rel="noopener"&gt;MSK Connect&lt;/a&gt; runs connectors using &lt;a href="https://docs.aws.amazon.com/msk/latest/developerguide/msk-connect-plugins.html" target="_blank" rel="noopener"&gt;custom plugins&lt;/a&gt; that you upload to Amazon S3. In this step, you download the Debezium PostgreSQL connector, package it as a ZIP file, upload it to S3, and register it with MSK Connect.&lt;/p&gt; 
&lt;p&gt;First, create an S3 bucket for the plugin, or use an existing metadata management bucket:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-bash"&gt;aws s3 mb s3://&amp;lt;your-plugin-bucket&amp;gt; --region &amp;lt;your-region&amp;gt;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Download and package the Debezium connector:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-bash"&gt;DEBEZIUM_VERSION=2.7.3.Final
curl -LO "https://repo1.maven.org/maven2/io/debezium/debezium-connector-postgres/${DEBEZIUM_VERSION}/debezium-connector-postgres-${DEBEZIUM_VERSION}-plugin.tar.gz"
mkdir -p debezium-plugin
tar -xzf debezium-connector-postgres-${DEBEZIUM_VERSION}-plugin.tar.gz -C debezium-plugin/
cd debezium-plugin &amp;amp;&amp;amp; zip -r ../debezium-postgres-connector.zip . &amp;amp;&amp;amp; cd ..
aws s3 cp debezium-postgres-connector.zip s3://&amp;lt;your-plugin-bucket&amp;gt;/plugins/&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Register the plugin with MSK Connect:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-bash"&gt;aws kafkaconnect create-custom-plugin \
    --custom-plugin-name debezium-postgres-connector \
    --content-type ZIP \
    --location "s3Location={bucketArn=arn:aws:s3:::&amp;lt;your-plugin-bucket&amp;gt;,fileKey=plugins/debezium-postgres-connector.zip}"&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Create a &lt;a href="https://docs.aws.amazon.com/msk/latest/developerguide/msk-connect-workers.html" target="_blank" rel="noopener"&gt;worker configuration&lt;/a&gt; that tells MSK Connect to serialize Kafka messages as JSON without schemas:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-bash"&gt;aws kafkaconnect create-worker-configuration \
    --name debezium-worker-config \
    --properties-file-content "$(echo -n 'key.converter=org.apache.kafka.connect.json.JsonConverter
value.converter=org.apache.kafka.connect.json.JsonConverter
key.converter.schemas.enable=false
value.converter.schemas.enable=false' | base64)"&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Note the &lt;code&gt;customPluginArn&lt;/code&gt; and &lt;code&gt;workerConfigurationArn&lt;/code&gt; from the output. You need these for the CDK configuration in the next step.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; The custom plugin and worker configuration are created through the AWS CLI because the Debezium connector JARs must be downloaded from the &lt;a href="https://debezium.io/releases/" target="_blank" rel="noopener"&gt;Debezium project&lt;/a&gt; and packaged manually. The remaining infrastructure is deployed using the AWS CDK in the following steps.&lt;/p&gt; 
&lt;h3 id="step-3-configure-the-cdk-project"&gt;Step 3: Configure the CDK project&lt;/h3&gt; 
&lt;p&gt;Clone the sample repository and install dependencies:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-bash"&gt;git clone https://github.com/aws-samples/sample-aurora-cdc-s3tables.git
cd sample-aurora-cdc-s3tables/cdk
npm install&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Open &lt;code&gt;cdk/lib/v2/config.ts&lt;/code&gt; and update the configuration values to match your environment:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-typescript"&gt;export const CONFIG = {
account: '&amp;lt;your-account-id&amp;gt;',
region: '&amp;lt;your-region&amp;gt;',
// VPC - must match your Aurora cluster's VPC
vpcId: '&amp;lt;your-vpc-id&amp;gt;',
subnetIds: ['&amp;lt;subnet-1&amp;gt;', '&amp;lt;subnet-2&amp;gt;'],
auroraSecurityGroupId: '&amp;lt;aurora-security-group-id&amp;gt;',
// Aurora connection details
auroraEndpoint: '&amp;lt;aurora-cluster-endpoint&amp;gt;',
auroraPort: '5432',
auroraDbName: '&amp;lt;database-name&amp;gt;',
auroraUser: '&amp;lt;db-user&amp;gt;',
auroraSecretArn: '&amp;lt;secrets-manager-arn&amp;gt;',
// Debezium - use the ARNs from Step 2
debeziumPluginArn: '&amp;lt;customPluginArn-from-step-2&amp;gt;',
debeziumWorkerConfigArn: '&amp;lt;workerConfigurationArn-from-step-2&amp;gt;',
debeziumPluginBucket: '&amp;lt;your-plugin-bucket-name&amp;gt;',
debeziumTopicPrefix: 'aurora.cdc',
debeziumTables: 'public.orders,public.products',
// S3 Tables - the table bucket name must be globally unique
s3TablesBucketName: '&amp;lt;your-table-bucket-name&amp;gt;',
s3TablesNamespace: 'aurora_cdc',
tables: ['orders', 'products'],
tableKeys: { orders: 'order_id', products: 'product_id' },
// Firehose - general purpose S3 bucket for failed record backup
firehoseBackupBucket: '&amp;lt;your-backup-bucket-name&amp;gt;',
};&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Key configuration notes:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;auroraSecurityGroupId.&lt;/strong&gt; The &lt;a href="https://docs.aws.amazon.com/vpc/latest/userguide/vpc-security-groups.html" target="_blank" rel="noopener"&gt;security group&lt;/a&gt; attached to your Aurora cluster. The CDK creates an MSK security group with ingress rules allowing traffic from this security group, and a reverse rule allowing MSK Connect workers to reach Aurora on port 5432.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;tableKeys.&lt;/strong&gt; The primary key column for each table. Firehose uses these to match incoming records against existing rows for &lt;a href="https://docs.aws.amazon.com/firehose/latest/dev/apache-iceberg-destination.html" target="_blank" rel="noopener"&gt;update and delete operations&lt;/a&gt; in the Iceberg tables.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;s3TablesBucketName.&lt;/strong&gt; The name for your S3 table bucket. Table bucket names must be &lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-buckets-naming.html" target="_blank" rel="noopener"&gt;unique for your account in the chosen Region&lt;/a&gt;.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h3 id="step-4-deploy-the-cdk-stacks"&gt;Step 4: Deploy the CDK stacks&lt;/h3&gt; 
&lt;p&gt;Deploy all six stacks with a single command. The CDK resolves the dependency order automatically:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-bash"&gt;npx cdk --app "npx ts-node bin/app-v2.ts" deploy --all&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;When prompted, review the &lt;a href="https://docs.aws.amazon.com/IAM/latest/UserGuide/introduction.html" target="_blank" rel="noopener"&gt;AWS Identity and Access Management (IAM)&lt;/a&gt; changes and confirm the deployment. The CDK deploys the following stacks:&lt;/p&gt; 
&lt;table border="1px" width="100%" cellpadding="10px"&gt; 
 &lt;tbody&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;strong&gt;Stack&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;What it creates&lt;/strong&gt;&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;code&gt;CdcMskCluster&lt;/code&gt;&lt;/td&gt; 
   &lt;td&gt;Amazon MSK cluster (2x kafka.m5.large brokers) with dual authentication (&lt;a href="https://docs.aws.amazon.com/msk/latest/developerguide/iam-access-control.html" target="_blank" rel="noopener"&gt;IAM&lt;/a&gt; for Firehose, unauthenticated for Debezium), custom configuration with &lt;code&gt;auto.create.topics.enable=true&lt;/code&gt;, security groups with ingress rules for Aurora and MSK Connect workers&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;code&gt;CdcMskConnectIam&lt;/code&gt;&lt;/td&gt; 
   &lt;td&gt;MSK Connect service execution role with permissions for Kafka cluster operations, VPC networking, S3 plugin access, and &lt;a href="https://docs.aws.amazon.com/secretsmanager/latest/userguide/intro.html" target="_blank" rel="noopener"&gt;AWS Secrets Manager&lt;/a&gt;; &lt;a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/WhatIsCloudWatchLogs.html" target="_blank" rel="noopener"&gt;Amazon CloudWatch Logs&lt;/a&gt; group for connector logs&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;code&gt;CdcS3Tables&lt;/code&gt;&lt;/td&gt; 
   &lt;td&gt;S3 table bucket, &lt;code&gt;aurora_cdc&lt;/code&gt; namespace, two Iceberg tables (&lt;code&gt;orders&lt;/code&gt;, &lt;code&gt;products&lt;/code&gt;) with column schemas&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;code&gt;CdcLambdaTransform&lt;/code&gt;&lt;/td&gt; 
   &lt;td&gt;Lambda function for CDC event transformation and multi-table routing&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;code&gt;CdcFirehoseRole&lt;/code&gt;&lt;/td&gt; 
   &lt;td&gt;Firehose IAM role with permissions for Amazon MSK, S3 Tables, &lt;a href="https://docs.aws.amazon.com/glue/latest/dg/catalog-and-crawler.html" target="_blank" rel="noopener"&gt;AWS Glue Data Catalog&lt;/a&gt;, &lt;a href="https://docs.aws.amazon.com/lake-formation/latest/dg/what-is-lake-formation.html" target="_blank" rel="noopener"&gt;AWS Lake Formation&lt;/a&gt;, VPC networking, and Lambda invocation&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;code&gt;CdcFirehose&lt;/code&gt;&lt;/td&gt; 
   &lt;td&gt;Firehose delivery stream with MSK as source (private connectivity through AWS PrivateLink), Lambda processing, Apache Iceberg Tables as destination with two table configurations, and S3 backup bucket for failed records&lt;/td&gt; 
  &lt;/tr&gt; 
 &lt;/tbody&gt; 
&lt;/table&gt; 
&lt;p&gt;The MSK cluster takes approximately 25 minutes to create. The Debezium connector takes approximately 5 minutes after the cluster is ready. You can monitor the deployment progress in the &lt;a href="https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/Welcome.html" target="_blank" rel="noopener"&gt;AWS CloudFormation&lt;/a&gt; console.&lt;/p&gt; 
&lt;p&gt;After the deployment completes, you can verify the resources in the AWS console. The S3 table bucket shows the two Iceberg tables in the &lt;code&gt;aurora_cdc&lt;/code&gt; namespace.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/06/05/BDB-5821-2.png" alt="Figure 2. S3 table bucket showing the orders and products Iceberg tables in the aurora_cdc namespace." width="600"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;Figure 2. S3 table bucket showing the orders and products Iceberg tables in the aurora_cdc namespace.&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;The Firehose delivery stream shows the MSK source, Lambda transformation, and Apache Iceberg Tables destination.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/06/05/BDB-5821-3.png" alt="Figure 3. Amazon Data Firehose delivery stream with MSK source, Lambda transformation, and Apache Iceberg Tables destination." width="600"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;Figure 3. Amazon Data Firehose delivery stream with MSK source, Lambda transformation, and Apache Iceberg Tables destination.&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;The MSK cluster uses dual authentication (&lt;a href="https://docs.aws.amazon.com/msk/latest/developerguide/iam-access-control.html" target="_blank" rel="noopener"&gt;IAM&lt;/a&gt; for Firehose, unauthenticated for Debezium through &lt;code&gt;TLS_PLAINTEXT&lt;/code&gt;), &lt;a href="https://docs.aws.amazon.com/msk/latest/developerguide/aws-access-mult-vpc.html" target="_blank" rel="noopener"&gt;multi-VPC private connectivity&lt;/a&gt; for Firehose PrivateLink access, and &lt;code&gt;auto.create.topics.enable=true&lt;/code&gt; so Debezium can create topics on first connect. VPC connectivity and the cluster resource policy are configured as CLI steps in Step 5.&lt;/p&gt; 
&lt;h3 id="step-5-enable-msk-vpc-connectivity-grant-lake-formation-permissions-and-apply-msk-cluster-policy"&gt;Step 5: Enable MSK VPC connectivity, grant Lake Formation permissions, and apply MSK cluster policy&lt;/h3&gt; 
&lt;p&gt;After the CDK deployment completes, enable &lt;a href="https://docs.aws.amazon.com/msk/latest/developerguide/aws-access-mult-vpc.html" target="_blank" rel="noopener"&gt;multi-VPC private connectivity&lt;/a&gt; with IAM on the MSK cluster. Firehose requires this to create an &lt;a href="https://docs.aws.amazon.com/vpc/latest/privatelink/what-is-privatelink.html" target="_blank" rel="noopener"&gt;AWS PrivateLink&lt;/a&gt; endpoint to the MSK brokers. This setting can’t be configured during cluster creation and must be applied as an update, which triggers a rolling broker restart (approximately 20–30 minutes).&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-bash"&gt;# Get the cluster ARN and current version from the CdcMskCluster stack outputs
MSK_ARN=&amp;lt;msk-cluster-arn&amp;gt;
CLUSTER_VERSION=$(aws kafka describe-cluster-v2 \
    --cluster-arn $MSK_ARN \
    --region &amp;lt;your-region&amp;gt; \
    --query 'ClusterInfo.CurrentVersion' --output text)
# Enable VPC connectivity with IAM
aws kafka update-connectivity \
    --cluster-arn $MSK_ARN \
    --current-version $CLUSTER_VERSION \
    --connectivity-info '{"VpcConnectivity":{"ClientAuthentication":{"Sasl":{"Iam":{"Enabled":true}}}}}' \
    --region &amp;lt;your-region&amp;gt;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Wait for the cluster state to return to &lt;code&gt;ACTIVE&lt;/code&gt; before proceeding:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-bash"&gt;aws kafka describe-cluster-v2 \
    --cluster-arn $MSK_ARN \
    --region &amp;lt;your-region&amp;gt; \
    --query 'ClusterInfo.State'&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Next, grant the Firehose IAM role permissions through &lt;a href="https://docs.aws.amazon.com/lake-formation/latest/dg/what-is-lake-formation.html" target="_blank" rel="noopener"&gt;AWS Lake Formation&lt;/a&gt;. S3 Tables uses a sub-catalog format for the &lt;code&gt;CatalogId&lt;/code&gt; parameter, which differs from the standard &lt;a href="https://docs.aws.amazon.com/glue/latest/dg/catalog-and-crawler.html" target="_blank" rel="noopener"&gt;AWS Glue Data Catalog&lt;/a&gt;. These permissions require a &lt;a href="https://docs.aws.amazon.com/lake-formation/latest/dg/initial-lf-config.html#create-data-lake-admin" target="_blank" rel="noopener"&gt;data lake administrator&lt;/a&gt; identity.&lt;/p&gt; 
&lt;p&gt;Grant database-level and table-level permissions to the Firehose role:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-bash"&gt;# Grant database-level permissions
aws lakeformation grant-permissions \
    --region &amp;lt;your-region&amp;gt; \
    --principal '{"DataLakePrincipalIdentifier": "&amp;lt;firehose-role-arn&amp;gt;"}' \
    --resource '{"Database": {"CatalogId": "&amp;lt;account-id&amp;gt;:s3tablescatalog/&amp;lt;table-bucket-name&amp;gt;", "Name": "aurora_cdc"}}' \
    --permissions '["ALL"]'
# Grant table-level permissions (wildcard for the tables in the namespace)
aws lakeformation grant-permissions \
    --region &amp;lt;your-region&amp;gt; \
    --principal '{"DataLakePrincipalIdentifier": "&amp;lt;firehose-role-arn&amp;gt;"}' \
    --resource '{"Table": {"CatalogId": "&amp;lt;account-id&amp;gt;:s3tablescatalog/&amp;lt;table-bucket-name&amp;gt;", "DatabaseName": "aurora_cdc", "TableWildcard": {}}}' \
    --permissions '["ALL"]'&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Note the &lt;code&gt;CatalogId&lt;/code&gt; format: &lt;code&gt;&amp;lt;account-id&amp;gt;:s3tablescatalog/&amp;lt;table-bucket-name&amp;gt;&lt;/code&gt;. This is specific to S3 Tables and tells Lake Formation to look up permissions in the S3 Tables catalog rather than the default Glue Data Catalog. For more information, see &lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-integrating-aws.html" target="_blank" rel="noopener"&gt;Integrating Amazon S3 Tables with AWS analytics services&lt;/a&gt;.&lt;/p&gt; 
&lt;p&gt;Next, attach a resource-based policy to the MSK cluster that grants the Firehose service principal permission to create &lt;a href="https://docs.aws.amazon.com/msk/latest/developerguide/aws-access-mult-vpc.html" target="_blank" rel="noopener"&gt;VPC connections&lt;/a&gt;:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-bash"&gt;aws kafka put-cluster-policy \
    --cluster-arn &amp;lt;msk-cluster-arn&amp;gt; \
    --region &amp;lt;your-region&amp;gt; \
    --policy '{
"Version": "2012-10-17",
"Statement": [{
"Effect": "Allow",
"Principal": {"Service": "firehose.amazonaws.com"},
"Action": ["kafka:CreateVpcConnection", "kafka:GetBootstrapBrokers", "kafka:DescribeClusterV2"],
"Resource": "&amp;lt;msk-cluster-arn&amp;gt;"
}]
}'&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;You can find the &lt;code&gt;&amp;lt;msk-cluster-arn&amp;gt;&lt;/code&gt; in the &lt;code&gt;CdcMskCluster&lt;/code&gt; stack outputs from Step 4, and the &lt;code&gt;&amp;lt;firehose-role-arn&amp;gt;&lt;/code&gt; in the &lt;code&gt;CdcFirehoseRole&lt;/code&gt; stack outputs.&lt;/p&gt; 
&lt;h3 id="step-6-create-the-debezium-connector"&gt;Step 6: Create the Debezium connector&lt;/h3&gt; 
&lt;p&gt;With the MSK cluster running and Lake Formation permissions in place, create the Debezium connector using the &lt;a href="https://docs.aws.amazon.com/msk/latest/developerguide/msk-connect.html" target="_blank" rel="noopener"&gt;MSK Connect&lt;/a&gt; API. The connector reads changes from Aurora PostgreSQL and publishes them to the MSK topic.&lt;/p&gt; 
&lt;p&gt;Firehose supports only one MSK topic per delivery stream, so each source table would otherwise need its own Firehose stream and &lt;a href="https://docs.aws.amazon.com/msk/latest/developerguide/aws-access-mult-vpc.html" target="_blank" rel="noopener"&gt;VPC connection&lt;/a&gt;. To avoid this, the connector uses the Debezium &lt;a href="https://debezium.io/documentation/reference/stable/transformations/topic-routing.html" target="_blank" rel="noopener"&gt;ByLogicalTableRouter&lt;/a&gt; Single Message Transform (SMT) to route changes from multiple tables into a single topic (&lt;code&gt;aurora.cdc.all-tables&lt;/code&gt;). The Lambda function then uses the source table name in each message to direct records to the correct Iceberg table. This single-topic pattern uses one Firehose stream for multiple tables, reducing cost and operational complexity.&lt;/p&gt; 
&lt;p&gt;First, retrieve the MSK bootstrap servers from the cluster:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-bash"&gt;aws kafka get-bootstrap-brokers \
    --cluster-arn &amp;lt;msk-cluster-arn&amp;gt; \
    --region &amp;lt;your-region&amp;gt;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Note the &lt;code&gt;BootstrapBrokerString&lt;/code&gt; value (the &lt;code&gt;PLAINTEXT&lt;/code&gt; brokers). Then create the connector:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-bash"&gt;aws kafkaconnect create-connector --cli-input-json '{
"connectorName": "aurora-postgres-debezium-connector",
"kafkaCluster": {
"apacheKafkaCluster": {
"bootstrapServers": "&amp;lt;bootstrap-servers&amp;gt;",
"vpc": {
"subnets": ["&amp;lt;subnet-1&amp;gt;", "&amp;lt;subnet-2&amp;gt;"],
"securityGroups": ["&amp;lt;msk-security-group-id&amp;gt;"]
}
}
},
"kafkaClusterClientAuthentication": {"authenticationType": "NONE"},
"kafkaClusterEncryptionInTransit": {"encryptionType": "PLAINTEXT"},
"kafkaConnectVersion": "2.7.1",
"plugins": [{"customPlugin": {"customPluginArn": "&amp;lt;custom-plugin-arn&amp;gt;", "revision": 1}}],
"serviceExecutionRoleArn": "&amp;lt;msk-connect-service-role-arn&amp;gt;",
"capacity": {"provisionedCapacity": {"mcuCount": 2, "workerCount": 2}},
"workerConfiguration": {"workerConfigurationArn": "&amp;lt;worker-config-arn&amp;gt;", "revision": 1},
"connectorConfiguration": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"tasks.max": "1",
"database.hostname": "&amp;lt;aurora-cluster-endpoint&amp;gt;",
"database.port": "5432",
"database.user": "&amp;lt;db-user&amp;gt;",
"database.password": "&amp;lt;db-password&amp;gt;",
"database.dbname": "&amp;lt;database-name&amp;gt;",
"database.server.name": "aurora_cdc",
"plugin.name": "pgoutput",
"slot.name": "debezium_slot",
"publication.name": "dbz_publication",
"table.include.list": "public.orders,public.products",
"topic.prefix": "aurora.cdc",
"schema.history.internal.kafka.topic": "schema-changes.aurora",
"schema.history.internal.kafka.bootstrap.servers": "&amp;lt;bootstrap-servers&amp;gt;",
"decimal.handling.mode": "string",
"time.precision.mode": "adaptive_time_microseconds",
"tombstones.on.delete": "false",
"snapshot.mode": "initial",
"publication.autocreate.mode": "filtered",
"transforms": "Reroute",
"transforms.Reroute.type": "io.debezium.transforms.ByLogicalTableRouter",
"transforms.Reroute.topic.regex": "aurora\\\\\\\\.cdc\\\\\\\\.public\\\\\\\\.(.*)",
"transforms.Reroute.topic.replacement": "aurora.cdc.all-tables"
},
"logDelivery": {
"workerLogDelivery": {
"cloudWatchLogs": {
"enabled": true,
"logGroup": "/aws/msk-connect/aurora-cdc-debezium"
}
}
}
}'&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;The &lt;code&gt;&amp;lt;msk-security-group-id&amp;gt;&lt;/code&gt; and &lt;code&gt;&amp;lt;msk-connect-service-role-arn&amp;gt;&lt;/code&gt; can be found in the &lt;code&gt;CdcMskCluster&lt;/code&gt; and &lt;code&gt;CdcMskConnectIam&lt;/code&gt; stack outputs respectively. The &lt;code&gt;ByLogicalTableRouter&lt;/code&gt; &lt;a href="https://debezium.io/documentation/reference/stable/transformations/topic-routing.html" target="_blank" rel="noopener"&gt;Single Message Transform&lt;/a&gt; routes CDC events from the monitored tables into a single topic (&lt;code&gt;aurora.cdc.all-tables&lt;/code&gt;).&lt;/p&gt; 
&lt;h3 id="step-7-verify-the-debezium-connector"&gt;Step 7: Verify the Debezium connector&lt;/h3&gt; 
&lt;p&gt;After creating the connector, verify that it is running and has completed its initial snapshot.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-bash"&gt;aws kafkaconnect list-connectors --region &amp;lt;your-region&amp;gt; \
    --query 'connectors[?connectorName==`aurora-postgres-debezium-connector`].{Name:connectorName,State:connectorState}' \
    --output table&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;The connector state should show &lt;code&gt;RUNNING&lt;/code&gt;, as shown in the following figure.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/06/05/BDB-5821-4.png" alt="Figure 4. Debezium connector running on Amazon MSK Connect." width="600"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;Figure 4. Debezium connector running on Amazon MSK Connect.&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;Check the CloudWatch Logs to confirm the snapshot completed:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-bash"&gt;aws logs tail /aws/msk-connect/aurora-cdc-debezium --follow --region &amp;lt;your-region&amp;gt;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;You should see messages indicating the transition to streaming mode:&lt;/p&gt; 
&lt;pre class="text"&gt;&lt;code&gt;Finished exporting 0 records for table 'public.orders' (1 of 2 tables)
Finished exporting 0 records for table 'public.products' (2 of 2 tables)
Snapshot completed
Starting streaming&lt;/code&gt;&lt;/pre&gt; 
&lt;p&gt;If the tables were empty when the connector started, the export count is 0. If you had existing data, the snapshot captures the existing rows as &lt;code&gt;r&lt;/code&gt; (read) operations, which the Lambda function maps to &lt;code&gt;insert&lt;/code&gt; operations in the Iceberg tables.&lt;/p&gt; 
&lt;p&gt;Verify that the Firehose delivery stream is active:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-bash"&gt;aws firehose describe-delivery-stream \
    --delivery-stream-name msk-to-s3tables-firehose \
    --region &amp;lt;your-region&amp;gt; \
    --query 'DeliveryStreamDescription.DeliveryStreamStatus'&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;The status should return &lt;code&gt;ACTIVE&lt;/code&gt;.&lt;/p&gt; 
&lt;h3 id="step-8-test-the-pipeline"&gt;Step 8: Test the pipeline&lt;/h3&gt; 
&lt;p&gt;Insert test data into the Aurora PostgreSQL source tables. Each insert triggers a CDC event that flows through the pipeline: Aurora WAL to Debezium to MSK topic to Firehose to Lambda transform to S3 Tables.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-sql"&gt;-- Insert orders
INSERT INTO public.orders (customer_id, order_date, total_amount, status)
VALUES
(1, '2026-01-20', 299.99, 'shipped'),
(2, '2026-01-21', 149.50, 'processing'),
(1, '2026-01-22', 89.99, 'delivered');
-- Insert products
INSERT INTO public.products (product_name, category, price, stock_quantity)
VALUES
('Wireless Headphones', 'Electronics', 79.99, 150),
('Running Shoes', 'Sports', 129.99, 75),
('Coffee Maker', 'Kitchen', 49.99, 200);&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;This creates six records across two tables. Each record generates a Debezium CDC event with operation type &lt;code&gt;c&lt;/code&gt; (create), which the Lambda function maps to an &lt;code&gt;insert&lt;/code&gt; operation in the corresponding Iceberg table.&lt;/p&gt; 
&lt;h3 id="step-9-verify-data-delivery"&gt;Step 9: Verify data delivery&lt;/h3&gt; 
&lt;p&gt;Check the Firehose &lt;code&gt;IncomingRecords&lt;/code&gt; metric to confirm records are flowing through the delivery stream:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-bash"&gt;aws cloudwatch get-metric-statistics \
    --namespace AWS/Firehose \
    --metric-name IncomingRecords \
    --dimensions Name=DeliveryStreamName,Value=msk-to-s3tables-firehose \
    --start-time $(date -u -v-10M +%Y-%m-%dT%H:%M:%S) \
    --end-time $(date -u +%Y-%m-%dT%H:%M:%S) \
    --period 60 --statistics Sum \
    --region &amp;lt;your-region&amp;gt;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;You should see a &lt;code&gt;Sum&lt;/code&gt; value of 6 or more. If the value is 0, wait another minute and retry. There can be a short delay between MSK topic delivery and Firehose metric reporting.&lt;/p&gt; 
&lt;p&gt;If records aren’t appearing, check the Firehose error output in the backup S3 bucket and the Lambda function’s CloudWatch Logs for transformation errors.&lt;/p&gt; 
&lt;h3 id="step-10-query-data-using-amazon-athena"&gt;Step 10: Query data using Amazon Athena&lt;/h3&gt; 
&lt;p&gt;With data delivered to S3 Tables, you can query the Iceberg tables using &lt;a href="https://docs.aws.amazon.com/athena/latest/ug/what-is.html" target="_blank" rel="noopener"&gt;Amazon Athena&lt;/a&gt;. S3 Tables integrates with the &lt;a href="https://docs.aws.amazon.com/glue/latest/dg/catalog-and-crawler.html" target="_blank" rel="noopener"&gt;AWS Glue Data Catalog&lt;/a&gt; as a sub-catalog, so you reference tables using the S3 Tables catalog format.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Tip:&lt;/strong&gt; If records aren’t appearing in Athena, check the Firehose &lt;code&gt;IncomingRecords&lt;/code&gt; CloudWatch metric and the Lambda function’s CloudWatch Logs for transformation errors.&lt;/p&gt; 
&lt;p&gt;Open the Athena console, select the &lt;strong&gt;AwsDataCatalog&lt;/strong&gt; data source, and run the following queries:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-sql"&gt;SELECT * FROM "s3tablescatalog/&amp;lt;table-bucket-name&amp;gt;"."aurora_cdc"."products" LIMIT 10;
SELECT * FROM "s3tablescatalog/&amp;lt;table-bucket-name&amp;gt;"."aurora_cdc"."orders" LIMIT 10;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Replace &lt;code&gt;&amp;lt;table-bucket-name&amp;gt;&lt;/code&gt; with your S3 table bucket name. You should see the records from the initial snapshot that Debezium captured when the connector started.&lt;/p&gt; 
&lt;p&gt;The following figures show the initial state of both tables as queried through Athena. At this point, the products table contains seven records and the orders table contains seven records, captured during the Debezium initial snapshot.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/06/05/BDB-5821-5.png" alt="Figure 5. Initial state of the products table in Amazon Athena, showing seven records captured from Aurora PostgreSQL through the CDC pipeline." width="600"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;Figure 5. Initial state of the products table in Amazon Athena, showing seven records captured from Aurora PostgreSQL through the CDC pipeline.&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/06/05/BDB-5821-6.png" alt="Figure 6. Initial state of the orders table in Amazon Athena, showing seven records captured from Aurora PostgreSQL through the CDC pipeline." width="600"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;Figure 6. Initial state of the orders table in Amazon Athena, showing seven records captured from Aurora PostgreSQL through the CDC pipeline.&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;Now test that update and delete operations propagate correctly. Run the following statements in Aurora:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-sql"&gt;-- Insert new records
INSERT INTO public.products (product_name, category, price, stock_quantity)
VALUES ('Bluetooth Speaker', 'Electronics', 129.99, 90), ('Standing Desk', 'Furniture', 799.99, 20);
INSERT INTO public.orders (customer_id, order_date, total_amount, status)
VALUES (201, '2026-04-03', 149.99, 'NEW'), (202, '2026-04-03', 249.50, 'NEW'), (203, '2026-04-03', 79.90, 'NEW');
-- Update existing records
UPDATE public.products SET stock_quantity = 30, price = 549.99 WHERE product_name = 'Ergonomic Chair';
UPDATE public.orders SET status = 'DELIVERED' WHERE order_id = 201;
-- Delete a record
DELETE FROM public.products WHERE product_name = 'Test Widget';&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Wait for the changes to propagate through the pipeline, then query Athena again. The following figures show the results after the insert, update, and delete operations have been applied.&lt;/p&gt; 
&lt;p&gt;In the products table, the Test Widget record (&lt;code&gt;product_id&lt;/code&gt; 100) is no longer present because it was removed by the delete operation. The Ergonomic Chair row now reflects the updated price (549.99) and stock quantity (30). Two new records, Bluetooth Speaker and Standing Desk, appear with a later &lt;code&gt;created_at&lt;/code&gt; timestamp, confirming they were inserted after the initial snapshot.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/06/05/BDB-5821-7.png" alt="Figure 7. Products table after CDC operations. The Ergonomic Chair, Headphones, and Desk Lamp rows reflect updated values. Bluetooth Speaker and Standing Desk are newly inserted records. The Test Widget record has been removed by the delete operation." width="600"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;Figure 7. Products table after CDC operations. The Ergonomic Chair, Headphones, and Desk Lamp rows reflect updated values. Bluetooth Speaker and Standing Desk are newly inserted records. The Test Widget record has been removed by the delete operation.&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;In the orders table, order 100 now shows a status of &lt;code&gt;SHIPPED&lt;/code&gt; and order 201 shows &lt;code&gt;DELIVERED&lt;/code&gt;, reflecting the update operations. Three new orders (301, 302, 303) appear with status &lt;code&gt;NEW&lt;/code&gt; and a later timestamp, confirming they were inserted after the initial load.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/06/05/BDB-5821-8.png" alt="Figure 8. Orders table after CDC operations. Orders 100 and 201 reflect updated status values. Orders 301, 302, and 303 are newly inserted records." width="600"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;Figure 8. Orders table after CDC operations. Orders 100 and 201 reflect updated status values. Orders 301, 302, and 303 are newly inserted records.&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;This confirms that the pipeline correctly handles the three CDC operation types: inserts, updates, and deletes are captured from the Aurora WAL by Debezium, routed through the single MSK topic, transformed by the Lambda function, and applied as row-level Iceberg operations by Firehose.&lt;/p&gt; 
&lt;p&gt;S3 Tables handles &lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-maintenance.html"&gt;compaction and snapshot management&lt;/a&gt; for Iceberg tables automatically, including compaction of small data files and expiration of old snapshots. You don’t need to run manual maintenance operations.&lt;/p&gt; 
&lt;p&gt;You can also use Iceberg’s &lt;a href="https://docs.aws.amazon.com/athena/latest/ug/querying-iceberg-table-data.html" target="_blank" rel="noopener"&gt;time travel&lt;/a&gt; capability to query the table as it existed before the updates:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-sql"&gt;SELECT * FROM "s3tablescatalog/&amp;lt;table-bucket-name&amp;gt;"."aurora_cdc"."orders"
FOR TIMESTAMP AS OF current_timestamp - interval '5' minute;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;This returns the original data before the update, demonstrating the time travel capability that Apache Iceberg provides through S3 Tables.&lt;/p&gt; 
&lt;h2 id="cleaning-up"&gt;Cleaning up&lt;/h2&gt; 
&lt;p&gt;To avoid ongoing charges, delete the resources in reverse dependency order.&lt;/p&gt; 
&lt;p&gt;Delete the CDK stacks:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-bash"&gt;cd cdk
npx cdk --app "npx ts-node bin/app-v2.ts" destroy --all&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Delete the Debezium custom plugin and worker configuration that were created through the AWS CLI in Step 2:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-bash"&gt;aws kafkaconnect delete-custom-plugin --custom-plugin-arn &amp;lt;plugin-arn&amp;gt;
aws kafkaconnect delete-worker-configuration --worker-configuration-arn &amp;lt;worker-config-arn&amp;gt;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Clean up the Aurora PostgreSQL replication resources:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-sql"&gt;SELECT pg_drop_replication_slot('debezium_slot');
DROP PUBLICATION dbz_publication;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;&lt;strong&gt;Important:&lt;/strong&gt; The replication slot (&lt;code&gt;debezium_slot&lt;/code&gt;) was created automatically by Debezium. If you plan to redeploy the pipeline later, you don’t need to drop the slot and publication. However, the replication slot continues to retain WAL segments while the connector isn’t running, which can increase storage usage on the Aurora cluster. The MSK cluster is the largest cost component of this solution and can’t be paused. It can only be deleted and recreated.&lt;/p&gt; 
&lt;h2 id="conclusion"&gt;Conclusion&lt;/h2&gt; 
&lt;p&gt;In this post, we showed you how to build a near real-time CDC pipeline from Aurora PostgreSQL to Apache Iceberg tables in Amazon S3 Tables. The key architectural decisions include:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Single-topic routing with multi-table delivery.&lt;/strong&gt; The Debezium &lt;code&gt;ByLogicalTableRouter&lt;/code&gt; SMT routes CDC events from multiple tables through one MSK topic, and the Lambda &lt;code&gt;otfMetadata&lt;/code&gt; routing directs each record to the correct Iceberg table. This reduces VPC connection costs by using a single Firehose stream for inserts, updates, and deletes across multiple destination tables.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Fully managed CDC pipeline.&lt;/strong&gt; MSK Connect runs Debezium, Firehose handles delivery with automatic retries, and S3 Tables manages Iceberg compaction and snapshots. The Lambda transform preserves CDC semantics by mapping Debezium operations to Iceberg row-level operations.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Governed lakehouse access.&lt;/strong&gt; Lake Formation controls fine-grained access to the Iceberg tables, and data from multiple isolated Aurora clusters can be unified in a single S3 Tables namespace for cross-domain analytics.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Infrastructure as code.&lt;/strong&gt; Six AWS CDK stacks deploy the core pipeline, with Lake Formation permissions, MSK cluster policy, and Debezium connector configured through documented CLI steps.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;To get started, clone the &lt;a href="https://github.com/aws-samples/sample-aurora-cdc-s3tables" target="_blank" rel="noopener"&gt;sample repository&lt;/a&gt; and follow the walkthrough steps. For more information about the services used in this solution, see the &lt;a href="https://docs.aws.amazon.com/msk/latest/developerguide/what-is-msk.html" target="_blank" rel="noopener"&gt;Amazon MSK Developer Guide&lt;/a&gt;, &lt;a href="https://docs.aws.amazon.com/firehose/latest/dev/what-is-this-service.html" target="_blank" rel="noopener"&gt;Amazon Data Firehose Developer Guide&lt;/a&gt;, and &lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables.html" target="_blank" rel="noopener"&gt;Amazon S3 Tables User Guide&lt;/a&gt;.&lt;/p&gt; 
&lt;p&gt;We encourage you to try this solution and adapt it to your own CDC workloads. If you have questions or feedback, leave a comment on this post.&lt;/p&gt; 
&lt;h2 id="related-posts"&gt;Related posts&lt;/h2&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;a href="https://aws.amazon.com/blogs/storage/build-a-data-lake-for-streaming-data-with-amazon-s3-tables-and-amazon-data-firehose/" target="_blank" rel="noopener"&gt;Build a data lake for streaming data with Amazon S3 Tables and Amazon Data Firehose&lt;/a&gt;&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://aws.amazon.com/blogs/big-data/stream-data-from-amazon-msk-to-apache-iceberg-tables-in-amazon-s3-and-amazon-s3-tables-using-amazon-data-firehose/"&gt;Stream CDC into an Amazon S3 data lake in Apache Iceberg format with AWS Glue Streaming and Amazon MSK Connect&lt;/a&gt;&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://aws.amazon.com/blogs/aws/introducing-amazon-msk-connect-stream-data-to-and-from-your-apache-kafka-clusters-using-managed-connectors/" target="_blank" rel="noopener"&gt;Introducing Amazon MSK Connect – Stream Data to and from Your Apache Kafka Clusters Using Managed Connectors&lt;/a&gt;&lt;/li&gt; 
&lt;/ul&gt; 
&lt;hr&gt; 
&lt;h2&gt;About the author&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt; 
   &lt;p&gt;&lt;img loading="lazy" class="alignleft size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/06/05/BDB-5821-9.jpeg" alt="Chintan Agrawal" width="100" height="100"&gt;&lt;/p&gt; 
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Chintan Agrawal&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/chintan-agrawal-87a866135/" target="_blank" rel="noopener"&gt;Chintan&lt;/a&gt; is a Solutions Architect with over 7 years of experience, with a specialization in Analytics and Healthcare domain. He possesses a strong enthusiasm for assisting clients in discovering valuable insights from their data. Through his expertise, he constructs innovative solutions that empower businesses to arrive at informed, data-driven choices.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
		
		
			</item>
		<item>
		<title>Beyond JSON blobs: Implementing the VARIANT data type in Apache Iceberg V3</title>
		<link>https://aws.amazon.com/blogs/big-data/beyond-json-blobs-implementing-the-variant-data-type-in-apache-iceberg-v3/</link>
		
		<dc:creator><![CDATA[Arun Shanmugam]]></dc:creator>
		<pubDate>Tue, 09 Jun 2026 17:01:16 +0000</pubDate>
				<category><![CDATA[Analytics]]></category>
		<category><![CDATA[AWS Glue]]></category>
		<guid isPermaLink="false">3ae5d302e7a4c5dea817dbd2b9019ad81e1897b3</guid>

					<description>This post is part 1 of a two-part series. We walk through the basics: creating an Iceberg V3 table with a VARIANT column, inserting semi-structured data, and querying it with variant_get(). In Part 2, we scale to millions of rows and benchmark VARIANT against traditional string storage. We measure the difference in query performance and storage footprint.</description>
										<content:encoded>&lt;p&gt;&lt;a href="https://iceberg.apache.org/" target="_blank" rel="noopener"&gt;Apache Iceberg&lt;/a&gt; V3 introduces the &lt;a href="https://iceberg.apache.org/spec/" target="_blank" rel="noopener"&gt;VARIANT data type&lt;/a&gt;. VARIANT provides data engineers with a high-performance, native solution for managing semi-structured data within the data lake. Consider a massive fleet of IoT sensors: street-level temperature probes, air quality monitors, and vehicle telemetry. Each device emits data in unique JSON structures that constantly evolve with firmware updates.&lt;/p&gt; 
&lt;p&gt;Historically, engineers were forced to store these payloads as STRING blobs. This legacy approach mandates expensive CPU-intensive parsing at runtime and inflates storage costs with redundant raw text. VARIANT solves these inefficiencies by employing a shredded, binary-encoded format. This allows query engines to skip irrelevant data and access specific nested fields with columnar speed, effectively bridging the gap between the flexibility of JSON and the performance of a structured schema.&lt;/p&gt; 
&lt;p&gt;VARIANT is stored in Parquet as a three-part group: binary metadata (type and dictionary info), a binary value (the full variant for fallback), and a &lt;code&gt;typed_value&lt;/code&gt; group where individual JSON fields are shredded into separate Parquet columns. When you query a specific field, Spark prunes the &lt;code&gt;typed_value&lt;/code&gt; group to include only the requested sub-columns. It always retains metadata and the value fallback, so it avoids reading the entire document. This approach delivers two concrete benefits:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Reduced query processing time&lt;/strong&gt;: Queries access only the fields they need without deserializing entire JSON documents. This reduces the amount of data scanned and the time spent on deserialization.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Lower storage footprint&lt;/strong&gt;: Binary encoding compresses more efficiently than raw text, reducing storage costs.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;Fields inside the JSON become individually accessible columns under the hood. A query that needs one value out of a deeply nested document no longer must read and deserialize the entire thing. You maintain schema flexibility while gaining the performance characteristics of structured columnar storage.&lt;/p&gt; 
&lt;p&gt;This post is part 1 of a two-part series. We walk through the basics: creating an Iceberg V3 table with a VARIANT column, inserting semi-structured data, and querying it with &lt;code&gt;variant_get()&lt;/code&gt;. In Part 2, we scale to millions of rows and benchmark VARIANT against traditional string storage. We measure the difference in query performance and storage footprint.&lt;/p&gt; 
&lt;h2&gt;Solution overview&lt;/h2&gt; 
&lt;p&gt;This walkthrough demonstrates an end-to-end workflow for working with semi-structured data using the VARIANT data type in Apache Iceberg V3 on Amazon EMR Serverless. Raw JSON payloads are ingested and converted to binary VARIANT format using &lt;code&gt;parse_json()&lt;/code&gt;. The data is stored in an Iceberg V3 table where the engine shreds the structure into columnar Parquet sub-columns. You can then query the data efficiently using &lt;code&gt;variant_get()&lt;/code&gt; to extract specific fields without deserializing the entire document. &lt;a href="https://docs.aws.amazon.com/glue/latest/dg/components-overview.html#data-catalog-intro" target="_blank" rel="noopener"&gt;AWS Glue Data Catalog&lt;/a&gt; manages the table metadata. Amazon Simple Storage Service (Amazon S3) provides the underlying storage.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: Check the Apache Iceberg documentation for the latest information on specification status and engine compatibility. Additionally, Fine-Grained Access Control (FGAC) through AWS Lake Formation is not currently supported for the VARIANT data type.&lt;/p&gt; 
&lt;h2&gt;How VARIANT works&lt;/h2&gt; 
&lt;p&gt;When you insert a JSON document into a VARIANT column, Spark converts it from a JSON string into the Variant binary format. During writes, the engine can shred the structure. It extracts individual fields and stores them as native Parquet-typed sub-columns within the VARIANT column’s &lt;code&gt;typed_value&lt;/code&gt; group. Fields that are not shredded remain in the binary value column as a fallback. This is conceptually similar to how a columnar table stores each column independently. The difference is that the sub-columns live within a single VARIANT column, and the engine handles the shredding schema automatically.&lt;/p&gt; 
&lt;p&gt;At query time, when you ask for a specific field using &lt;code&gt;variant_get()&lt;/code&gt;, Spark reads only the sub-column that contains that field. It does not need to load or parse the rest of the document. For workloads that repeatedly query a handful of fields out of large, complex JSON payloads, this can significantly reduce the amount of data scanned. It also reduces the time spent deserializing it.&lt;/p&gt; 
&lt;p&gt;The &lt;code&gt;variant_get()&lt;/code&gt; function uses JSON path syntax to navigate the structure. You can extract scalar values with an explicit type (optional), access nested objects, and reach into arrays by index. The function signature is the following.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-plaintext"&gt;variant_get(column, '$.path.to.field', 'type')&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Where &lt;code&gt;column&lt;/code&gt; is the VARIANT column name, the second argument is a JSON path expression, and the optional third argument specifies the expected return type (such as &lt;code&gt;'string'&lt;/code&gt;, &lt;code&gt;'int'&lt;/code&gt;, or &lt;code&gt;'double'&lt;/code&gt;). When the type argument is omitted, the function returns a VARIANT value that preserves the original encoding.&lt;/p&gt; 
&lt;h2&gt;Running Iceberg V3 on Amazon EMR Serverless&lt;/h2&gt; 
&lt;p&gt;&lt;a href="https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless.html" target="_blank" rel="noopener"&gt;Amazon EMR Serverless&lt;/a&gt; 8.0 ships with Apache Spark 4.0.1, which includes native support for Iceberg V3 and the VARIANT data type. You do not need to install additional libraries or configure custom JARs. Amazon EMR Serverless manages the compute infrastructure and scales resources up and down based on workload demand. You can focus on the data rather than the cluster.&lt;/p&gt; 
&lt;p&gt;While this post uses Amazon EMR Serverless, Iceberg V3 VARIANT support is also available on Amazon EMR on EC2 and Amazon EMR on EKS. You can choose the deployment model that fits your environment.&lt;/p&gt; 
&lt;h2&gt;Getting started&lt;/h2&gt; 
&lt;p&gt;The following walkthrough creates an Iceberg V3 table with a VARIANT column, inserts a set of IoT sensor events, and runs queries to extract fields from the semi-structured payload. Each step includes the code you need to run it on Amazon EMR Serverless.&lt;/p&gt; 
&lt;h3&gt;Prerequisites&lt;/h3&gt; 
&lt;p&gt;Before you begin, verify you have the following:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;An AWS account with permissions to create Amazon EMR Serverless applications and access Amazon Simple Storage Service (Amazon S3).&lt;/li&gt; 
 &lt;li&gt;An Amazon S3 bucket for storing Iceberg table data and scripts.&lt;/li&gt; 
 &lt;li&gt;AWS Glue Data Catalog configured for metadata management.&lt;/li&gt; 
 &lt;li&gt;An IAM execution role with permissions for Amazon EMR Serverless, Amazon S3, AWS Glue, and Amazon CloudWatch Logs.&lt;/li&gt; 
 &lt;li&gt;AWS Command Line Interface (AWS CLI) installed and configured.&lt;strong&gt;Note:&lt;/strong&gt; Running this solution in your AWS account might incur charges for Amazon EMR Serverless, Amazon S3, and AWS Glue. Refer to the respective pricing pages for cost details.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h3&gt;Step 1: Initialize a Spark session with Iceberg V3&lt;/h3&gt; 
&lt;p&gt;Start by creating a Spark session configured to use the Iceberg catalog backed by AWS Glue. The key settings are the Iceberg Spark extensions and the AWS Glue catalog implementation. Replace &lt;code&gt;&amp;lt;YOUR_S3_BUCKET&amp;gt;&lt;/code&gt; with your bucket name.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-python"&gt;from pyspark.sql import SparkSession
from pyspark.sql.functions import col, parse_json

spark = SparkSession.builder \
    .appName("IcebergV3VariantDemo") \
    .config("spark.sql.extensions",
            "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
    .config("spark.sql.catalog.glue_catalog",
            "org.apache.iceberg.spark.SparkCatalog") \
    .config("spark.sql.catalog.glue_catalog.warehouse",
            "s3://&amp;lt;YOUR_S3_BUCKET&amp;gt;/warehouse/") \
    .config("spark.sql.catalog.glue_catalog.catalog-impl",
            "org.apache.iceberg.aws.glue.GlueCatalog") \
    .config("spark.sql.catalog.glue_catalog.io-impl",
            "org.apache.iceberg.aws.s3.S3FileIO") \
    .getOrCreate()&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;When running on Amazon EMR Serverless, some Spark configurations might be set at the application or job level. The configuration shown here is included in the script for completeness. Depending on your Amazon EMR Serverless application settings, you might not need to specify all these properties in the script.&lt;/p&gt; 
&lt;h3&gt;Step 2: Create an Iceberg V3 table with a VARIANT column&lt;/h3&gt; 
&lt;p&gt;Create a namespace and table. &lt;strong&gt;The format version must be set to 3 for VARIANT data type support&lt;/strong&gt;. The following table models IoT sensor events with a few standard columns and a VARIANT column for the semi-structured payload.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-python"&gt;spark.sql("CREATE NAMESPACE IF NOT EXISTS glue_catalog.iceberg_v3_demo")

spark.sql("""
CREATE TABLE IF NOT EXISTS glue_catalog.iceberg_v3_demo.sensor_events (
    event_id STRING,
    device_id STRING,
    event_timestamp TIMESTAMP,
    event_data VARIANT
)
USING iceberg
TBLPROPERTIES (
    'format-version' = '3'
)
""")&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;The &lt;code&gt;event_data&lt;/code&gt; column is declared as VARIANT. Iceberg stores it in Parquet as a binary-encoded VARIANT structure (metadata, value, and optional shredded sub-columns) rather than as a plain text string.&lt;/p&gt; 
&lt;h3&gt;Step 3: Insert semi-structured data&lt;/h3&gt; 
&lt;p&gt;To insert JSON data into a VARIANT column, use the &lt;code&gt;parse_json()&lt;/code&gt; function. This converts a JSON &lt;em&gt;string into the binary VARIANT format&lt;/em&gt; at write time. The following example creates a small DataFrame of IoT events and appends them to the table.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-python"&gt;import json
from pyspark.sql.functions import current_timestamp
from pyspark.sql.types import StructType, StructField, StringType

# Sample IoT events with nested JSON payloads
events = [
    ("evt_001", "sensor_001", json.dumps({
        "device": {"manufacturer": "SensorTech", "model": "ST-200",
                   "firmware_version": "3.1.4"},
        "sensors": {"temperature": 22.5, "humidity": 61.3,
                    "air_quality": {"pm25": 12.4, "co2": 415}},
        "network": {"connection": "WiFi", "latency_ms": 42},
        "alerts": [{"severity": "low", "message": "Calibration due"}]
    })),
    ("evt_002", "sensor_002", json.dumps({
        "device": {"manufacturer": "IoTCorp", "model": "IC-500",
                   "firmware_version": "2.8.1"},
        "sensors": {"temperature": 34.1, "humidity": 78.9,
                    "air_quality": {"pm25": 142.7, "co2": 1850}},
        "network": {"connection": "LTE", "latency_ms": 210},
        "alerts": [{"severity": "critical",
                    "message": "Temperature threshold exceeded"},
                   {"severity": "high",
                    "message": "Poor air quality detected"}]
    })),
    ("evt_003", "sensor_003", json.dumps({
        "device": {"manufacturer": "SmartDevices", "model": "SD-100",
                   "firmware_version": "1.5.9"},
        "sensors": {"temperature": 18.7, "humidity": 45.2,
                    "air_quality": {"pm25": 8.1, "co2": 390}},
        "network": {"connection": "Ethernet", "latency_ms": 5},
        "alerts": []
    })),
]

schema = StructType([
    StructField("event_id", StringType(), False),
    StructField("device_id", StringType(), False),
    StructField("event_data", StringType(), False),
])

df = spark.createDataFrame(events, schema)
df = df.withColumn("event_timestamp", current_timestamp())

# Convert JSON string to VARIANT using parse_json
df = df.withColumn("event_data", parse_json(col("event_data")))

df.writeTo("glue_catalog.iceberg_v3_demo.sensor_events").append()
print("Data inserted successfully.")&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;The &lt;code&gt;parse_json()&lt;/code&gt; call is the key step. It takes the raw JSON string and encodes it into &lt;em&gt;the binary VARIANT&lt;/em&gt; format before writing to the Iceberg table.&lt;/p&gt; 
&lt;h3&gt;Step 4: Query VARIANT data with variant_get()&lt;/h3&gt; 
&lt;p&gt;Once the data is in the table, you can extract individual fields from the VARIANT column using &lt;code&gt;variant_get()&lt;/code&gt;. The following queries demonstrate three common patterns: simple field extraction, deep nested access with filtering, and array element access.&lt;/p&gt; 
&lt;p&gt;The following queries are shown as raw SQL for readability. To run them in your PySpark script, wrap each query in a &lt;code&gt;spark.sql()&lt;/code&gt; call. For example: &lt;code&gt;spark.sql("SELECT ...").show()&lt;/code&gt;.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Query 1: Simple field extraction&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;Extract top-level sensor readings from the payload.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-sql"&gt;SELECT
    event_id,
    device_id,
    variant_get(event_data, '$.sensors.temperature', 'double') AS temperature,
    variant_get(event_data, '$.sensors.humidity', 'double') AS humidity
FROM glue_catalog.iceberg_v3_demo.sensor_events&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;This query reads only the temperature and humidity sub-columns from the VARIANT data. It does not parse or load the rest of the JSON document.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Query 2: Deep nested access with filtering&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;Reach into nested objects and filter on a value buried inside the structure.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-sql"&gt;SELECT
    device_id,
    variant_get(event_data, '$.sensors.air_quality.pm25', 'double') AS pm25,
    variant_get(event_data, '$.sensors.air_quality.co2', 'int') AS co2_level,
    variant_get(event_data, '$.device.manufacturer', 'string') AS manufacturer
FROM glue_catalog.iceberg_v3_demo.sensor_events
WHERE variant_get(event_data, '$.sensors.air_quality.pm25', 'double') &amp;gt; 100.0&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;The &lt;code&gt;WHERE&lt;/code&gt; clause filters directly on a nested VARIANT field. Spark evaluates the predicate against the shredded sub-column without deserializing the full payload.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Query 3: Array element access&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;Access elements inside a JSON array stored within the VARIANT column.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-sql"&gt;SELECT
    event_id,
    device_id,
    variant_get(event_data, '$.alerts[0].severity', 'string') AS first_alert_severity,
    variant_get(event_data, '$.alerts[0].message', 'string') AS first_alert_message
FROM glue_catalog.iceberg_v3_demo.sensor_events
WHERE variant_get(event_data, '$.alerts[0].severity', 'string') = 'critical'&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Array indexing uses standard bracket notation in the JSON path. This query finds events where the first alert has critical severity and returns the alert details.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/14/BDB-5525-1.png" alt="Query results showing simple field extraction, nested access with filtering, and array element access from the VARIANT column" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;Figure 1: Query results showing simple field extraction, nested access with filtering, and array element access from the VARIANT column.&lt;/em&gt;&lt;/p&gt; 
&lt;h2&gt;Submitting the job to Amazon EMR Serverless&lt;/h2&gt; 
&lt;p&gt;To run this on Amazon EMR Serverless, save the preceding code as a single PySpark script (for example, &lt;code&gt;iceberg_v3_variant_demo.py&lt;/code&gt;), upload it to Amazon S3, and submit it as a job. Replace the placeholder values with your own.&lt;/p&gt; 
&lt;p&gt;Before submitting the job, make sure you have created an Amazon EMR Serverless application. For instructions, see &lt;a href="https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/getting-started.html" target="_blank" rel="noopener"&gt;Getting started with Amazon EMR Serverless&lt;/a&gt; in the Amazon EMR documentation.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-bash"&gt;# Upload script to S3
aws s3 cp iceberg_v3_variant_demo.py \
    s3://&amp;lt;YOUR_S3_BUCKET&amp;gt;/scripts/ \
    --region &amp;lt;REGION&amp;gt;

# Submit the job
aws emr-serverless start-job-run \
    --application-id &amp;lt;APPLICATION_ID&amp;gt; \
    --execution-role-arn arn:aws:iam::&amp;lt;ACCOUNT_ID&amp;gt;:role/EMRServerlessExecutionRole \
    --job-driver '{
        "sparkSubmit": {
            "entryPoint": "s3://&amp;lt;YOUR_S3_BUCKET&amp;gt;/scripts/iceberg_v3_variant_demo.py"
        }
    }' \
    --configuration-overrides '{
        "monitoringConfiguration": {
            "cloudWatchLoggingConfiguration": {
                "enabled": true,
                "logGroupName": "/aws/emr-serverless/applications/&amp;lt;APPLICATION_ID&amp;gt;"
            }
        }
    }' \
    --region &amp;lt;REGION&amp;gt;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;h2&gt;Use cases&lt;/h2&gt; 
&lt;p&gt;VARIANT fits naturally into workloads where the data is semi-structured and the schema is not fully known in advance. Some use cases include the following:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;IoT and sensor data&lt;/strong&gt;: Device fleets produce telemetry in varying JSON formats that evolve with firmware updates. VARIANT stores these payloads without requiring a fixed schema, and queries can extract specific readings without scanning the entire document.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Clickstream analytics&lt;/strong&gt;: User behavior events on websites and mobile apps carry different attributes depending on the action. Page views, clicks, form submissions, and purchases each have their own structure. VARIANT accommodates these data types in a single column.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Log analytics&lt;/strong&gt;: Application logs, infrastructure metrics, and audit trails often arrive as unstructured or loosely structured JSON. VARIANT lets you ingest them as is and query specific fields on demand, without defining a schema up front.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;Clean up&lt;/h2&gt; 
&lt;p&gt;To avoid ongoing charges, delete the resources you created:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Drop the Iceberg table and namespace using Spark SQL. 
  &lt;div class="hide-language"&gt; 
   &lt;pre&gt;&lt;code class="language-python"&gt;spark.sql("DROP TABLE IF EXISTS glue_catalog.iceberg_v3_demo.sensor_events")
spark.sql("DROP NAMESPACE IF EXISTS glue_catalog.iceberg_v3_demo")&lt;/code&gt;&lt;/pre&gt; 
  &lt;/div&gt; &lt;/li&gt; 
 &lt;li&gt;Stop and delete the Amazon EMR Serverless application. 
  &lt;div class="hide-language"&gt; 
   &lt;pre&gt;&lt;code class="language-bash"&gt;aws emr-serverless delete-application --application-id &amp;lt;APPLICATION_ID&amp;gt; --region &amp;lt;REGION&amp;gt;&lt;/code&gt;&lt;/pre&gt; 
  &lt;/div&gt; &lt;/li&gt; 
 &lt;li&gt;Delete the S3 objects and bucket used for table data, scripts, and logs. 
  &lt;div class="hide-language"&gt; 
   &lt;pre&gt;&lt;code class="language-bash"&gt;aws s3 rm s3://&amp;lt;YOUR_S3_BUCKET&amp;gt;/warehouse/ --recursive
aws s3 rm s3://&amp;lt;YOUR_S3_BUCKET&amp;gt;/scripts/ --recursive&lt;/code&gt;&lt;/pre&gt; 
  &lt;/div&gt; &lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;Conclusion&lt;/h2&gt; 
&lt;p&gt;Apache Iceberg V3’s VARIANT type provides an efficient way to store and query semi-structured data in your data lake. Columnar storage and shredding reduce storage costs, and direct field access through &lt;code&gt;variant_get()&lt;/code&gt; removes the need to parse JSON strings at query time. On Amazon EMR Serverless, you get this capability without managing infrastructure.&lt;/p&gt; 
&lt;p&gt;In Part 2 of this series, we scale to millions of rows and benchmark VARIANT against traditional string storage. We measure query performance and storage footprint under realistic workloads.&lt;/p&gt; 
&lt;p&gt;To learn more about Apache Iceberg on AWS, see &lt;a href="https://docs.aws.amazon.com/prescriptive-guidance/latest/apache-iceberg-on-aws/introduction.html" target="_blank" rel="noopener"&gt;Apache Iceberg on AWS&lt;/a&gt; prescriptive guidance. For more information about Amazon EMR Serverless, see the &lt;a href="https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless.html" target="_blank" rel="noopener"&gt;Amazon EMR Serverless documentation&lt;/a&gt;.&lt;/p&gt; 
&lt;hr&gt; 
&lt;h2&gt;About the authors&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt; 
   &lt;p&gt;&lt;img loading="lazy" class="alignleft size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/14/BDB-5525-2.jpeg" alt="Arun Shanmugam" width="100" height="100"&gt;&lt;/p&gt; 
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Arun Shanmugam&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/arun-shanmugam/" target="_blank" rel="noopener"&gt;Arun&lt;/a&gt; is a Senior Analytics Solutions Architect at AWS, with a focus on building modern data architecture. He has been successfully delivering scalable data analytics solutions for customers across diverse industries. Outside of work, Arun is an avid outdoor enthusiast who actively engages in CrossFit, road biking, and cricket.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt; 
   &lt;p&gt;&lt;img loading="lazy" class="alignleft size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/14/BDB-5525-3.jpeg" alt="Suthan Phillips" width="100" height="100"&gt;&lt;/p&gt; 
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Suthan Phillips&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/suthan-phillips/" target="_blank" rel="noopener"&gt;Suthan&lt;/a&gt; is a Senior Analytics Architect at AWS, where he helps customers design and optimize scalable, high-performance data solutions that drive business insights. He combines architectural guidance on system design and scalability with best practices to provide efficient, secure implementation across data processing and experience layers. Outside of work, Suthan enjoys swimming, hiking, and exploring the Pacific Northwest.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt; 
   &lt;p&gt;&lt;img loading="lazy" class="alignleft size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/14/BDB-5525-4.jpeg" alt="Ron Ortloff" width="100" height="100"&gt;&lt;/p&gt; 
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Ron Ortloff&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/ron-ortloff/"&gt;Ron Ortloff&lt;/a&gt; is a Principal Product Manager at AWS, where he focuses on Apache Iceberg, S3 Tables, and open data lakehouse solutions. He has over 15 years of experience building and leading data platform initiatives, including launching Azure Synapse Analytics at Microsoft and leading Iceberg and data lake strategy at Snowflake. When he’s not building data platforms, Ron can be found cheering on his favorite football and hockey teams.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt; 
   &lt;p&gt;&lt;img loading="lazy" class="alignleft size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/14/BDB-5525-5.jpeg" alt="Xiaoxuan Li" width="100" height="100"&gt;&lt;/p&gt; 
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Xiaoxuan Li&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/lixiaoxuan/"&gt;Xiaoxuan&lt;/a&gt; is a Software Development Engineer at AWS, working on the performance and scalability of Apache Iceberg in large-scale data lakehouse systems. Her interests span query optimization, storage-efficient architectures, and distributed data processing. Outside of work, she explores AI systems for creative storytelling and tooling for writers and content creators.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
		
		
			</item>
		<item>
		<title>Upgrade PySpark from Spark 3.5 to Spark 4.0 with AWS Spark Upgrade Agent</title>
		<link>https://aws.amazon.com/blogs/big-data/upgrade-pyspark-from-spark-3-5-to-spark-4-0-with-aws-spark-upgrade-agent/</link>
		
		<dc:creator><![CDATA[Prasad Nadig]]></dc:creator>
		<pubDate>Tue, 09 Jun 2026 16:17:23 +0000</pubDate>
				<category><![CDATA[Advanced (300)]]></category>
		<category><![CDATA[Amazon EMR]]></category>
		<category><![CDATA[Technical How-to]]></category>
		<guid isPermaLink="false">9970b234db66dd5f0b5f2159e09bf7270295af6c</guid>

					<description>In this post, we walk through a hands-on PySpark migration from Spark 3.5 to Spark 4.0 on Amazon EMR Serverless, using the AWS Spark Upgrade Agent. You’ll see how the agent iteratively validates your application on a live Amazon EMR Serverless application, automatically diagnosing and resolving failures from Amazon CloudWatch logs until the job succeeds.</description>
										<content:encoded>&lt;p&gt;Upgrading Apache Spark applications across major versions means tracking down breaking changes, manually debugging failures from log files, and running repeated test cycles. This process can stretch across weeks for complex code bases.&lt;/p&gt; 
&lt;p&gt;In this post, we walk through a hands-on PySpark migration from Spark 3.5 to Spark 4.0 on &lt;a href="https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless.html" target="_blank" rel="noopener"&gt;Amazon EMR Serverless&lt;/a&gt;, using the &lt;a href="https://docs.aws.amazon.com/emr/latest/ReleaseGuide/spark-upgrades.html" target="_blank" rel="noopener"&gt;AWS Spark Upgrade Agent&lt;/a&gt;. You’ll see how the agent iteratively validates your application on a live Amazon EMR Serverless application, automatically diagnosing and resolving failures from &lt;a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/WhatIsCloudWatch.html" target="_blank" rel="noopener"&gt;Amazon CloudWatch&lt;/a&gt; logs until the job succeeds. By the end, you have a multi-pipeline PySpark application running on Spark 4.0 with four distinct breaking changes resolved. The fixes include configuration key removals, codec renames, and stricter charset validation, all driven through natural language interaction in the Integrated Development Environment (IDE).&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;This is part 2 of a three-part series on how the AWS Spark Upgrade Agent can automate and simplify Spark upgrades.&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;In &lt;a href="https://aws.amazon.com/blogs/big-data/introducing-apache-spark-upgrade-agent-for-amazon-emr/" target="_blank" rel="noopener"&gt;Part 1&lt;/a&gt;, we introduced the agent’s architecture and capabilities. This post walks through a complete PySpark migration from Spark 3.5 to Spark 4.0 on Amazon EMR Serverless.&lt;/p&gt; 
&lt;p&gt;In the sections that follow, you will set up the prerequisites and infrastructure, explore the sample application, run the iterative validation workflow on EMR Serverless, review data quality results, and generate a comprehensive upgrade summary.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Because this upgrade is performed using the AWS Spark Upgrade Agent Model Context Protocol (MCP) server, an agentic artificial intelligence (AI) system, the agent might take different paths to reach the same successful outcome. The workflow demonstrated here represents one successful upgrade path. The key takeaway is the end-to-end workflow: generating an upgrade plan, iteratively validating on Amazon EMR Serverless, and producing a comprehensive upgrade summary.&lt;/p&gt; 
&lt;h2&gt;1. Prerequisites and setup&lt;/h2&gt; 
&lt;p&gt;This section covers the tools, infrastructure, and IDE configuration you need before starting the upgrade. To follow along, you need an AWS account with an AWS Identity and Access Management (AWS IAM) user or role that has permissions to deploy AWS CloudFormation stacks, create AWS IAM roles and policies, and create Amazon EMR Serverless applications. Intermediate knowledge of AWS Command Line Interface (AWS CLI), AWS CloudFormation, and Python is helpful.&lt;/p&gt; 
&lt;h3&gt;1.1 Install Kiro CLI and local tools&lt;/h3&gt; 
&lt;p&gt;In this post, we use &lt;a href="https://kiro.dev/docs/cli/" target="_blank" rel="noopener"&gt;&lt;strong&gt;Kiro CLI&lt;/strong&gt;&lt;/a&gt; to demonstrate the upgrade workflow. You can use an MCP-compatible IDE or framework. Examples include VS Code with Cline, Cursor, Windsurf, and Claude Desktop, among others. To follow along with Kiro CLI, install it on your workstation. For more details on the installation and setup, refer to &lt;a href="https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-upgrade-agent-setup.html#spark-upgrade-agent-setup-resources" target="_blank" rel="noopener"&gt;Setup for Upgrade Agent&lt;/a&gt;:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-bash"&gt;curl -fsSL https://cli.kiro.dev/install | bash&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Run the following command and use your builder ID to log in:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-bash"&gt;kiro-cli login --use-device-flow&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;With the Kiro CLI installed and logged in, rather than installing the remaining tools manually, use Kiro CLI to set up and verify your prerequisites with the following prompt:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-plaintext"&gt;kiro-cli chat
&amp;gt; Install AWS CLI, Python 3.10, and uv on my system if they are not already installed&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/14/BDB-5889-1.jpg" alt="Kiro CLI output showing successful installation of AWS CLI, Python, and uv" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;Output of AWS CLI and local tools install step.&lt;/p&gt; 
&lt;p&gt;These tools are needed for the upgrade workflow:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html" target="_blank" rel="noopener"&gt;&lt;strong&gt;AWS CLI&lt;/strong&gt;&lt;/a&gt;: Configured with a profile that has permissions to assume the AWS Identity and Access Management (AWS IAM) role created following.&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://www.python.org/downloads/release/python-3100/" target="_blank" rel="noopener"&gt;&lt;strong&gt;Python 3.10+&lt;/strong&gt;&lt;/a&gt;: Required to match the EMR 8.0 runtime.&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://docs.astral.sh/uv/getting-started/installation/" target="_blank" rel="noopener"&gt;&lt;strong&gt;uv package manager&lt;/strong&gt;&lt;/a&gt;: Required by the &lt;a href="https://github.com/aws/mcp-proxy-for-aws" target="_blank" rel="noopener"&gt;MCP Proxy for AWS&lt;/a&gt;.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h3&gt;1.2 Infrastructure setup (AWS CloudFormation)&lt;/h3&gt; 
&lt;p&gt;Two AWS CloudFormation stacks create the required resources: an AWS IAM role, an Amazon Simple Storage Service (Amazon S3) staging bucket, an Amazon EMR Serverless application (Spark 4.0.1), and its execution role.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Stack 1 – AWS IAM role and Amazon S3 staging bucket:&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;The &lt;a href="https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-upgrade-agent-setup.html#spark-upgrade-agent-setup-resources" target="_blank" rel="noopener"&gt;spark-upgrade-mcp-setup&lt;/a&gt; template creates the AWS IAM role and Amazon S3 staging bucket required by the upgrade agent. For additional Regions, see the &lt;a href="https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-upgrade-agent-setup.html#spark-upgrade-agent-setup-resources" target="_blank" rel="noopener"&gt;full region list&lt;/a&gt;.&lt;/p&gt; 
&lt;p&gt;After deployment, open the AWS CloudFormation &lt;strong&gt;Outputs&lt;/strong&gt; tab, copy the &lt;code&gt;ExportCommand&lt;/code&gt; value, and run it in your terminal. This sets &lt;code&gt;SMUS_MCP_REGION&lt;/code&gt;, &lt;code&gt;IAM_ROLE&lt;/code&gt;, and &lt;code&gt;STAGING_BUCKET_PATH&lt;/code&gt; automatically.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/14/BDB-5889-2.jpg" alt="CloudFormation Outputs tab showing ExportCommand with SMUS_MCP_REGION, IAM_ROLE, and STAGING_BUCKET_PATH values" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;Outputs tab of the CloudFormation stack.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-bash"&gt;# Sets SMUS_MCP_REGION, IAM_ROLE, and STAGING_BUCKET_PATH
export SMUS_MCP_REGION=&amp;lt;YOUR-REGION&amp;gt; &amp;amp;&amp;amp; export IAM_ROLE=arn:aws:iam::&amp;lt;YOUR-ACCOUNT-ID&amp;gt;:role/spark-upgrade-role-* &amp;amp;&amp;amp; export STAGING_BUCKET_PATH=&amp;lt;YOUR-BUCKET&amp;gt;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Then configure the AWS CLI profile:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-bash"&gt;aws configure set profile.spark-upgrade-profile.role_arn ${IAM_ROLE}
aws configure set profile.spark-upgrade-profile.source_profile default
aws configure set profile.spark-upgrade-profile.region ${SMUS_MCP_REGION}&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;&lt;strong&gt;Stack 2 – Amazon EMR Serverless target application and execution role:&lt;/strong&gt;&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-bash"&gt;git clone https://github.com/aws-samples/sample-amazon-emr-spark4-examples
cd sample-amazon-emr-spark4-examples/pyspark/AWSSpark4AutoUpgradeDemo&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;The PySpark sample lives at &lt;code&gt;resources/global_logistics_platform/&lt;/code&gt;. The AWS CloudFormation template lives at &lt;code&gt;resources/cloudformation/&lt;/code&gt;.&lt;/p&gt; 
&lt;p&gt;Deploy the AWS CloudFormation template to create the source and target Amazon EMR Serverless applications and a shared execution role:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-bash"&gt;aws cloudformation deploy \
  --template-file resources/cloudformation/emr-serverless-target-setup.yaml \
  --stack-name spark-emr-serverless-upgrade \
  --region ${SMUS_MCP_REGION} \
  --capabilities CAPABILITY_NAMED_IAM \
  --parameter-overrides \
    StagingBucketName=${STAGING_BUCKET_PATH} \
    SourceReleaseLabel=emr-7.0.0 \
    TargetReleaseLabel=emr-spark-8.0-preview \
    SourceApplicationName=spark-upgrade-source \
    TargetApplicationName=spark-upgrade-target&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;This creates two Amazon EMR Serverless applications: a source (Spark 3.5.0) for data quality baseline and a target (Spark 4.0.1) for upgrade validation, with a shared execution role. Both applications auto-stop after 15 minutes of idle time, so there is no cost when not in use. To upgrade between different Spark versions, override &lt;code&gt;SourceReleaseLabel&lt;/code&gt; and &lt;code&gt;TargetReleaseLabel&lt;/code&gt; with your target Amazon EMR release labels.&lt;/p&gt; 
&lt;p&gt;After the stack completes deployment, note the outputs:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-bash"&gt;aws cloudformation describe-stacks \
  --stack-name spark-emr-serverless-upgrade \
  --region ${SMUS_MCP_REGION} \
  --query "Stacks[0].Outputs" --output table&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;This gives you the &lt;code&gt;SourceApplicationId&lt;/code&gt;, &lt;code&gt;TargetApplicationId&lt;/code&gt;, and &lt;code&gt;ExecutionRoleArn&lt;/code&gt; needed for the upgrade prompt. Make a note of them.&lt;/p&gt; 
&lt;h3&gt;1.3 IDE and MCP server configuration&lt;/h3&gt; 
&lt;p&gt;Configure the &lt;code&gt;spark-upgrade&lt;/code&gt; MCP server. For &lt;a href="https://kiro.dev/docs/cli/" target="_blank" rel="noopener"&gt;Kiro CLI&lt;/a&gt;:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-bash"&gt;kiro-cli-chat mcp add \
    --name "spark-upgrade" \
    --command "uvx" \
    --args '[
      "mcp-proxy-for-aws@latest",
      "https://sagemaker-unified-studio-mcp.'${SMUS_MCP_REGION}'.api.aws/spark-upgrade/mcp",
      "--service", "sagemaker-unified-studio-mcp",
      "--profile", "spark-upgrade-profile",
      "--region", "'${SMUS_MCP_REGION}'",
      "--read-timeout", "180"
    ]' \
    --timeout 180000 \
    --scope global&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;For other MCP clients, refer to your IDE’s MCP configuration documentation and use the same server parameters shown previously.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Verify the connection:&lt;/strong&gt; Start Kiro CLI and confirm the &lt;code&gt;spark-upgrade&lt;/code&gt; tools are loaded:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-plaintext"&gt;$ kiro-cli chat
...
spark-upgrade (MCP):
- generate_spark_upgrade_plan          * not trusted
- update_build_configuration           * not trusted
- fix_upgrade_failure                  * not trusted
- run_validation_job                   * not trusted
- check_job_status                     * not trusted
...&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;&lt;strong&gt;Tip:&lt;/strong&gt; After Kiro CLI and the MCP server are configured, you can ask the agent to verify your setup. For example: &lt;em&gt;“Check if I have AWS CLI, Python 3.10+, and uv installed, and confirm the spark-upgrade MCP server is connected.”&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/14/BDB-5889-3.jpg" alt="Kiro CLI output confirming spark-upgrade MCP server connection and tool availability" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;Output showing the status of each tool, AWS CLI, and MCP server.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Tip: Trust mode vs.&amp;nbsp;confirm mode:&lt;/strong&gt; When running the upgrade agent in Kiro CLI, you have two options:&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Trust mode:&lt;/strong&gt; Type &lt;code&gt;t&lt;/code&gt; when prompted to approve a tool. The agent auto-approves subsequent uses of that tool without asking for confirmation. You can also use &lt;code&gt;/tools trust-all&lt;/code&gt; to trust every tool at once for a fully autonomous experience.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Confirm mode:&lt;/strong&gt; Type &lt;code&gt;y&lt;/code&gt; for each individual tool invocation. This lets you review, verify, and approve every action before the agent runs it. If this is your first time using the agent, use confirm mode for full visibility.&lt;/p&gt; 
&lt;h2&gt;2. Hands-on PySpark upgrade from Spark 3.5 to Spark 4.0&lt;/h2&gt; 
&lt;p&gt;This section walks through the complete migration of a representative PySpark application from Amazon EMR Serverless 7.0.0 (Spark 3.5.0) to EMR Serverless with the &lt;code&gt;emr-spark-8.0-preview&lt;/code&gt; release label (Spark 4.0.1), using the &lt;code&gt;global_logistics_platform&lt;/code&gt; sample.&lt;/p&gt; 
&lt;h3&gt;2.1 Sample project: global logistics platform&lt;/h3&gt; 
&lt;p&gt;The sample application is a multi-domain PySpark data processing application with three pipelines:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Fleet management:&lt;/strong&gt; Processes vehicle telemetry data (GPS tracking, fuel consumption, driver behavior scoring) using window functions, lag/lead operations, and statistical aggregations. Writes Parquet with &lt;code&gt;lz4raw&lt;/code&gt; compression.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;International shipping:&lt;/strong&gt; Handles cross-border shipment documents with multi-language address standardization using character encoding functions (&lt;code&gt;encode&lt;/code&gt;/&lt;code&gt;decode&lt;/code&gt; with charsets like &lt;code&gt;Shift_JIS&lt;/code&gt;, &lt;code&gt;GB2312&lt;/code&gt;, &lt;code&gt;EUC-KR&lt;/code&gt;), and processes carrier manifests with &lt;code&gt;ISO-8859-1&lt;/code&gt; encoding.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Historical compliance:&lt;/strong&gt; Processes regulatory audit records spanning centuries (including pre-1582 Julian calendar dates), requiring legacy datetime rebasing for Parquet writes.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;&lt;strong&gt;Project structure:&lt;/strong&gt;&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-plaintext"&gt;global_logistics_platform/
├── main.py                          # Orchestrator - runs all 3 pipelines
├── src/
│   ├── utils/
│   │   └── spark_config.py          # Spark session config &amp;amp; logging
│   └── domain/                      # Application code that needs migration
│       ├── fleet_management/
│       │   └── telemetry_processor.py
│       ├── international_shipping/
│       │   └── shipment_processor.py
│       └── historical_compliance/
│           └── compliance_processor.py
└── data/                             # Sample dataset for the workflow
    └── sample/
        ├── fleet_telemetry.csv
        ├── international_shipments.csv
        └── compliance_records.csv&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;h3&gt;2.2 The four Spark 4.0 incompatibilities&lt;/h3&gt; 
&lt;p&gt;Before diving into the upgrade, here are the four specific breaking changes present in this code base that the agent discovers and resolves entirely through runtime validation:&lt;/p&gt; 
&lt;table border="1px" width="100%" cellpadding="10px"&gt; 
 &lt;tbody&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;strong&gt;#&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Incompatibility&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;File(s)&lt;/strong&gt;&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;1&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Legacy Parquet configuration key removed:&lt;/strong&gt; &lt;code&gt;spark.sql.legacy.parquet.datetimeRebaseModeInWrite&lt;/code&gt; removed in Spark 4.0. Must use &lt;code&gt;spark.sql.parquet.datetimeRebaseModeInWrite&lt;/code&gt;.&lt;/td&gt; 
   &lt;td&gt;&lt;code&gt;spark_config.py&lt;/code&gt;&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;2&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Parquet compression codec rename:&lt;/strong&gt; &lt;code&gt;lz4raw&lt;/code&gt; codec renamed to &lt;code&gt;lz4_raw&lt;/code&gt; in Spark 4.0.&lt;/td&gt; 
   &lt;td&gt;&lt;code&gt;telemetry_processor.py&lt;/code&gt;&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;3&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Stricter charset encoding validation:&lt;/strong&gt; Spark 4.0 tightened &lt;code&gt;encode()&lt;/code&gt; behavior. Encoding CJK (Chinese, Japanese, Korean) characters to &lt;code&gt;ISO-8859-1&lt;/code&gt; now throws &lt;code&gt;MALFORMED_CHARACTER_CODING&lt;/code&gt;. In Spark 3.x this silently replaced unmappable chars with &lt;code&gt;?&lt;/code&gt;. Restored via &lt;code&gt;spark.sql.legacy.codingErrorAction&lt;/code&gt;.&lt;/td&gt; 
   &lt;td&gt;&lt;code&gt;spark_config.py&lt;/code&gt;&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;4&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Character encoding restrictions:&lt;/strong&gt; &lt;code&gt;encode()&lt;/code&gt;/&lt;code&gt;decode()&lt;/code&gt; in Spark 4.0 supports US-ASCII, ISO-8859-1, UTF-8, UTF-16BE, UTF-16LE, UTF-16, and UTF-32. Code uses &lt;code&gt;Shift_JIS&lt;/code&gt;, &lt;code&gt;GB2312&lt;/code&gt;, &lt;code&gt;EUC-KR&lt;/code&gt;.&lt;/td&gt; 
   &lt;td&gt;&lt;code&gt;shipment_processor.py&lt;/code&gt;&lt;/td&gt; 
  &lt;/tr&gt; 
 &lt;/tbody&gt; 
&lt;/table&gt; 
&lt;p&gt;The agent resolves each of these through iterative runtime validation on EMR Serverless: submitting the job, diagnosing failures from Amazon CloudWatch logs, applying fixes, and resubmitting until the job succeeds.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/14/BDB-5889-4.png" alt="Architecture diagram showing the iterative validation workflow between the IDE, MCP server, and Amazon EMR Serverless" width="600"&gt;&lt;/p&gt; 
&lt;h3&gt;2.3 Step 1: Invoke the upgrade agent&lt;/h3&gt; 
&lt;p&gt;Open the project in Kiro CLI and enter the following prompt:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-plaintext"&gt;Upgrade my Spark application in the current directory from EMR serverless version 7.0.0 to EMR serverless version 8.0.0.
Use Amazon EMR Serverless target app-id &amp;lt;YOUR-TARGET-APP-ID&amp;gt; and execution role
&amp;lt;YOUR-EXECUTION-ROLE-ARN&amp;gt; for validation.
Use source Amazon EMR Serverless app-id &amp;lt;YOUR-SOURCE-APP-ID&amp;gt; for data quality baseline.
Store artifacts at s3://${STAGING_BUCKET_PATH}/spark4-upgrade/python/
Enable data quality validation&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;&lt;strong&gt;Tip:&lt;/strong&gt; The &lt;code&gt;SourceApplicationId&lt;/code&gt;, &lt;code&gt;TargetApplicationId&lt;/code&gt;, and &lt;code&gt;ExecutionRoleArn&lt;/code&gt; are in the Outputs of the &lt;code&gt;spark-emr-serverless-upgrade&lt;/code&gt; AWS CloudFormation stack you deployed in Section 1.2.&lt;/p&gt; 
&lt;p&gt;The agent invokes &lt;code&gt;generate_spark_upgrade_plan&lt;/code&gt;, scans the project structure, identifies the Spark version mapping (EMR 7.0.0 → Spark 3.5.0, EMR 8.0.0 → Spark 4.0.1), and produces a structured upgrade plan with an Analysis ID for traceability.&lt;/p&gt; 
&lt;p&gt;The agent presents the plan and asks for confirmation. Type &lt;code&gt;y&lt;/code&gt; to approve the tool invocation, or &lt;code&gt;t&lt;/code&gt; to trust that tool for the rest of the session.&lt;/p&gt; 
&lt;p&gt;You have an option to save the plan as a local JSON file for future reference or to resume the upgrade at a later point, so go ahead and ask Kiro to save it locally. Provide the AWS CLI profile that you have configured on your system. Use the following prompt to provide these inputs:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-plaintext"&gt;Yes I would like to save the plan to a local file and use spark-upgrade-profile&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;h3&gt;2.4 Step 2: Build and package&lt;/h3&gt; 
&lt;p&gt;The agent validates the Python project compiles successfully, then packages it for Amazon EMR Serverless deployment:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Runs &lt;code&gt;py_compile&lt;/code&gt; on each &lt;code&gt;.py&lt;/code&gt; file to verify syntax.&lt;/li&gt; 
 &lt;li&gt;Creates &lt;code&gt;src.zip&lt;/code&gt; containing the &lt;code&gt;src/&lt;/code&gt; directory (preserving the import structure used by &lt;code&gt;from src.utils import ...&lt;/code&gt;).&lt;/li&gt; 
 &lt;li&gt;Uploads &lt;code&gt;src.zip&lt;/code&gt;, &lt;code&gt;main.py&lt;/code&gt;, and sample input data to the Amazon S3 staging path.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-bash"&gt;# What the agent does behind the scenes:
zip -r src.zip src/
aws s3 cp main.py s3://&amp;lt;YOUR-BUCKET&amp;gt;/spark4-upgrade/python/&amp;lt;ANALYSIS-ID&amp;gt;/source/main.py
aws s3 cp src.zip s3://&amp;lt;YOUR-BUCKET&amp;gt;/spark4-upgrade/python/&amp;lt;ANALYSIS-ID&amp;gt;/source/src.zip
aws s3 cp data/sample/ s3://&amp;lt;YOUR-BUCKET&amp;gt;/spark4-upgrade/python/&amp;lt;ANALYSIS-ID&amp;gt;/input/ --recursive&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;No external dependencies (no &lt;code&gt;requirements.txt&lt;/code&gt;), so no virtual environment is needed. If your project has external dependencies in a &lt;code&gt;requirements.txt&lt;/code&gt;, the agent will package them into a virtual environment archive and include it in the EMR Serverless submission parameters.&lt;/p&gt; 
&lt;h3&gt;2.5 Step 3: Data quality baseline on source application&lt;/h3&gt; 
&lt;p&gt;Before migrating the code, the agent establishes a data quality baseline by running the &lt;strong&gt;original (pre-upgrade) code&lt;/strong&gt; on the source Amazon EMR Serverless application (Spark 3.5.0 / EMR 7.0.0). This captures the expected output that the upgraded application must match.&lt;/p&gt; 
&lt;p&gt;The agent submits the job to the source application with data quality check enabled:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-json"&gt;{
  "executionRoleArn": "arn:aws:iam::&amp;lt;YOUR-ACCOUNT-ID&amp;gt;:role/&amp;lt;YOUR-EXECUTION-ROLE&amp;gt;",
  "jobDriver": {
    "sparkSubmit": {
      "entryPoint": "s3://&amp;lt;YOUR-BUCKET&amp;gt;/spark4-upgrade/python/&amp;lt;ANALYSIS-ID&amp;gt;/source/main.py",
      "entryPointArguments": [
        "s3://&amp;lt;YOUR-BUCKET&amp;gt;/spark4-upgrade/python/&amp;lt;ANALYSIS-ID&amp;gt;/input/",
        "s3://&amp;lt;YOUR-BUCKET&amp;gt;/spark4-upgrade/python/&amp;lt;ANALYSIS-ID&amp;gt;/output/source/"
      ],
      "sparkSubmitParameters": "--py-files s3://&amp;lt;YOUR-BUCKET&amp;gt;/spark4-upgrade/python/&amp;lt;ANALYSIS-ID&amp;gt;/source/src.zip"
    }
  },
  "configurationOverrides": {
    "monitoringConfiguration": {
      "cloudWatchLoggingConfiguration": {
        "enabled": true,
        "logGroupName": "/aws/emr-serverless"
      }
    }
  }
}&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;The agent monitors the source run via &lt;code&gt;check_job_status&lt;/code&gt; until it completes successfully. This baseline output is stored for comparison after the target validation succeeds.&lt;/p&gt; 
&lt;h3&gt;2.6 Step 4: Iterative runtime validation on target application&lt;/h3&gt; 
&lt;p&gt;This is the core of the upgrade. The agent submits the unmodified application to the target Amazon EMR Serverless application (Spark 4.0.1), and every incompatibility is discovered, diagnosed, and fixed through runtime failures. The agent drives the entire fix cycle by submitting to EMR, reading errors from Amazon CloudWatch logs, applying fixes, rebuilding, and resubmitting.&lt;/p&gt; 
&lt;p&gt;The agent presents the proposed Amazon EMR Serverless job configuration for your review before each submission. Type &lt;code&gt;y&lt;/code&gt; to approve.&lt;/p&gt; 
&lt;h4&gt;2.6.1 Fix 1: Legacy Parquet configuration key removed (iteration 1)&lt;/h4&gt; 
&lt;p&gt;The first submission &lt;strong&gt;fails immediately&lt;/strong&gt; at SparkSession initialization:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-plaintext"&gt;org.apache.spark.sql.AnalysisException:
The SQL config 'spark.sql.legacy.parquet.datetimeRebaseModeInWrite' was removed
in the version 4.0.0. Use 'spark.sql.parquet.datetimeRebaseModeInWrite' instead.&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;The Historical Compliance pipeline configures &lt;code&gt;spark.sql.legacy.parquet.datetimeRebaseModeInWrite&lt;/code&gt; for handling pre-1582 Julian calendar dates. Spark 4.0 removed the &lt;code&gt;legacy.&lt;/code&gt; prefix from this configuration key.&lt;/p&gt; 
&lt;p&gt;The agent calls &lt;code&gt;fix_upgrade_failure&lt;/code&gt;, which identifies the migration rule and recommends the fix:&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;File:&lt;/strong&gt; &lt;code&gt;src/utils/spark_config.py&lt;/code&gt;&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-python"&gt;# Before
.config("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "LEGACY")

# After
.config("spark.sql.parquet.datetimeRebaseModeInWrite", "LEGACY")&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;After applying the fix, the agent rebuilds &lt;code&gt;src.zip&lt;/code&gt;, re-uploads to Amazon S3, and resubmits the job.&lt;/p&gt; 
&lt;h4&gt;2.6.2 Fix 2: Parquet compression codec rename (iteration 2)&lt;/h4&gt; 
&lt;p&gt;The resubmitted job &lt;strong&gt;fails&lt;/strong&gt; with a new error, which confirms progress:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-plaintext"&gt;pyspark.errors.exceptions.captured.IllegalArgumentException:
[CODEC_NOT_AVAILABLE.WITH_AVAILABLE_CODECS_SUGGESTION]
The codec lz4raw is not available.
Available codecs are brotli, uncompressed, lzo, snappy, lz4_raw, none, zstd, lz4, gzip.
SQLSTATE: 56038&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;The Fleet Management pipeline’s &lt;code&gt;telemetry_processor.py&lt;/code&gt; uses &lt;code&gt;lz4raw&lt;/code&gt; as the Parquet compression codec. Spark 4.0 renamed this to &lt;code&gt;lz4_raw&lt;/code&gt; (with an underscore).&lt;/p&gt; 
&lt;p&gt;The recommended fix:&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;File:&lt;/strong&gt; &lt;code&gt;src/domain/fleet_management/telemetry_processor.py&lt;/code&gt;&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-python"&gt;# Before
.option("compression", "lz4raw")

# After
.option("compression", "lz4_raw")&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;The agent applies the change, rebuilds, and resubmits.&lt;/p&gt; 
&lt;h4&gt;2.6.3 Fix 3: Stricter charset encoding validation (iteration 3)&lt;/h4&gt; 
&lt;p&gt;The next submission surfaces a different failure:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-plaintext"&gt;org.apache.spark.SparkRuntimeException:
[MALFORMED_CHARACTER_CODING]
Invalid value found when performing `encode` with ISO-8859-1
SQLSTATE: 22000&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;The International Shipping pipeline’s &lt;code&gt;process_carrier_manifests()&lt;/code&gt; method uses &lt;code&gt;encode(..., 'ISO-8859-1')&lt;/code&gt; on data containing CJK (Chinese, Japanese, Korean) characters. Although &lt;code&gt;ISO-8859-1&lt;/code&gt; is in Spark 4.0’s supported charset list, it is a single-byte encoding that cannot represent CJK characters. In Spark 3.x, the Java charset encoder silently replaced unmappable characters with &lt;code&gt;?&lt;/code&gt;. Spark 4.0 tightened this behavior to throw &lt;code&gt;MALFORMED_CHARACTER_CODING&lt;/code&gt; for unmappable characters.&lt;/p&gt; 
&lt;p&gt;The agent identifies the migration rule and adds a legacy compatibility configuration:&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;File:&lt;/strong&gt; &lt;code&gt;src/utils/spark_config.py&lt;/code&gt;&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-python"&gt;# Added to SparkSession builder
.config("spark.sql.legacy.codingErrorAction", "true")&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;This restores the Spark 3.x behavior where unmappable characters are silently replaced instead of throwing errors.&lt;/p&gt; 
&lt;p&gt;With the configuration added, the agent rebuilds and resubmits.&lt;/p&gt; 
&lt;h4&gt;2.6.4 Fix 4: Character encoding restrictions (iteration 4)&lt;/h4&gt; 
&lt;p&gt;The fourth submission fails with yet another encoding error:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-plaintext"&gt;org.apache.spark.SparkIllegalArgumentException:
[INVALID_PARAMETER_VALUE.CHARSET]
The value of parameter(s) `charset` in `encode` is invalid:
expects one of the iso-8859-1, us-ascii, utf-16, utf-16be, utf-16le, utf-32, utf-8,
but got Shift_JIS. SQLSTATE: 22023&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;The International Shipping pipeline’s &lt;code&gt;standardize_addresses_with_charset()&lt;/code&gt; method uses &lt;code&gt;Shift_JIS&lt;/code&gt;, &lt;code&gt;GB2312&lt;/code&gt;, and &lt;code&gt;EUC-KR&lt;/code&gt; charsets in &lt;code&gt;encode()&lt;/code&gt;/&lt;code&gt;decode()&lt;/code&gt; calls. Spark 4.0 restricts these functions to seven standard charsets. These regional charsets are not in the supported list.&lt;/p&gt; 
&lt;p&gt;The agent replaces the unsupported charsets with UTF-8:&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;File:&lt;/strong&gt; &lt;code&gt;src/domain/international_shipping/shipment_processor.py&lt;/code&gt;&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Before (Spark 3.5.0):&lt;/strong&gt;&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-python"&gt;df = df.withColumn(
    "shipper_address_normalized",
    when(col("origin_country") == "JP",
         expr("decode(encode(shipper_address, 'Shift_JIS'), 'UTF-8')"))
    .when(col("origin_country") == "CN",
         expr("decode(encode(shipper_address, 'GB2312'), 'UTF-8')"))
    .when(col("origin_country") == "KR",
         expr("decode(encode(shipper_address, 'EUC-KR'), 'UTF-8')"))
    .otherwise(col("shipper_address"))
)&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;&lt;strong&gt;After (Spark 4.0.1):&lt;/strong&gt;&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-python"&gt;df = df.withColumn(
    "shipper_address_normalized",
    when(col("origin_country") == "JP",
         expr("decode(encode(shipper_address, 'UTF-8'), 'UTF-8')"))
    .when(col("origin_country") == "CN",
         expr("decode(encode(shipper_address, 'UTF-8'), 'UTF-8')"))
    .when(col("origin_country") == "KR",
         expr("decode(encode(shipper_address, 'UTF-8'), 'UTF-8')"))
    .otherwise(col("shipper_address"))
)&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;The same transformation is applied to &lt;code&gt;consignee_address_normalized&lt;/code&gt;.&lt;/p&gt; 
&lt;p&gt;The agent rebuilds and resubmits one final time.&lt;/p&gt; 
&lt;h4&gt;2.6.5 Final submission: success&lt;/h4&gt; 
&lt;p&gt;The fifth submission completes successfully:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-json"&gt;{"success": true, "message": "EMR SERVERLESS job completed successfully",
"compute_run_id": "&amp;lt;JOB-RUN-ID&amp;gt;", "status": "SUCCESS",
"application_type": "EMR-Serverless"}&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;The three pipelines (Fleet Management, International Shipping, and Historical Compliance) complete on EMR Serverless with the &lt;code&gt;emr-spark-8.0-preview&lt;/code&gt; release label (Spark 4.0.1).&lt;/p&gt; 
&lt;h3&gt;2.7 Summary of the iterative runtime validation&lt;/h3&gt; 
&lt;p&gt;The runtime validation loop is the core value of the upgrade agent. Here’s the complete iteration history:&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/14/BDB-5889-5.png" alt="Table showing the four validation iterations with error types and fixes applied" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;Each iteration follows the same cycle:&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/14/BDB-5889-6.png" alt="Diagram showing the submit, diagnose, fix, rebuild, and resubmit cycle" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;Failures that would normally require manual log analysis, root cause investigation, and code patching are resolved automatically by the agent in this workflow.&lt;/p&gt; 
&lt;h2&gt;3. Data quality validation&lt;/h2&gt; 
&lt;p&gt;With both the source baseline (Section 2.5) and the upgraded target run (Section 2.6) completed successfully, the agent performs data quality validation to verify the migration hasn’t changed your application’s output. This is the key advantage of including the source application in your upgrade prompt: the agent can compare outputs from both Spark versions side by side.&lt;/p&gt; 
&lt;h3&gt;3.1 Data quality comparison&lt;/h3&gt; 
&lt;p&gt;The agent invokes &lt;code&gt;get_data_quality_summary&lt;/code&gt; to compare the outputs across four dimensions:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Schema validation:&lt;/strong&gt; Confirms column names, data types, and column ordering match between source and target outputs.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Row count validation:&lt;/strong&gt; Verifies no data loss or duplication during migration.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Nullability validation:&lt;/strong&gt; Detects changes in null handling.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Statistical summary validation:&lt;/strong&gt; Compares numeric and string column distributions (&lt;code&gt;min&lt;/code&gt;, &lt;code&gt;max&lt;/code&gt;, &lt;code&gt;mean&lt;/code&gt;, &lt;code&gt;count&lt;/code&gt;, distinct values).&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;The agent presents the comparison results:&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/14/BDB-5889-7.png" alt="Data quality summary showing schema, row count, and nullability checks passing with a statistical mismatch in shipper_address" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;The preceding image shows the data quality summary.&lt;/p&gt; 
&lt;p&gt;Three of four checks pass cleanly. The statistical summary validation detects a mismatch in the &lt;code&gt;shipper_address&lt;/code&gt; column of the &lt;code&gt;customs_declarations&lt;/code&gt; output: the &lt;code&gt;max&lt;/code&gt; and &lt;code&gt;min&lt;/code&gt; summary values differ between source and target.&lt;/p&gt; 
&lt;h3&gt;3.2 Understanding and resolving the mismatch&lt;/h3&gt; 
&lt;p&gt;This mismatch is a direct consequence of Fix 4 (Section 2.6.4). The original code ran addresses through a &lt;code&gt;Shift_JIS&lt;/code&gt;/&lt;code&gt;GB2312&lt;/code&gt;/&lt;code&gt;EUC-KR&lt;/code&gt; → &lt;code&gt;UTF-8&lt;/code&gt; roundtrip that produced garbled text, because the intermediate regional charset corrupted multi-byte UTF-8 characters. The upgraded code uses &lt;code&gt;UTF-8&lt;/code&gt; → &lt;code&gt;UTF-8&lt;/code&gt;, preserving addresses faithfully. The mismatch reflects &lt;strong&gt;improved&lt;/strong&gt; data quality, not a regression.&lt;/p&gt; 
&lt;p&gt;Schema, row counts, and nullability matched exactly: the difference is limited to string values that were previously garbled. No further action is needed. The upgraded application is production-ready.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Expected behavior:&lt;/strong&gt; Character encoding migrations might change string values, although they preserve semantic meaning. When data quality validation reports mismatches, trace each one back to a specific code change. If the mismatch is explained by a required migration fix (as here), verify the new behavior is correct and document it. If a mismatch cannot be explained, investigate before promoting to production.&lt;/p&gt; 
&lt;h2&gt;4. Upgrade summary&lt;/h2&gt; 
&lt;p&gt;After the agent completes the entire upgrade workflow, it produces a comprehensive upgrade summary following a structured template. This summary lets you review the job configuration updates, code modifications with diffs and file references, relevant migration rules applied, and data quality validation status.&lt;/p&gt; 
&lt;p&gt;Here is the summary the agent produced for this upgrade:&lt;/p&gt; 
&lt;h3&gt;Upgrade plan&lt;/h3&gt; 
&lt;ul&gt; 
 &lt;li&gt;Compile and build project with current Spark 3.5.0: validated that Python files compile successfully.&lt;/li&gt; 
 &lt;li&gt;Run baseline validation on source EMR Serverless (00g4vhvt1lhtrs09) with Spark 3.5.0: established data quality baseline.&lt;/li&gt; 
 &lt;li&gt;Run target validation on target EMR Serverless (00g4vhvt3np1bj09) with Spark 4.0.1: fixed 4 issues iteratively across 4 validation attempts.&lt;/li&gt; 
 &lt;li&gt;Compare data quality between source and target runs: detected expected mismatch in &lt;code&gt;shipper_address&lt;/code&gt;.&lt;/li&gt; 
 &lt;li&gt;Generate and persist upgrade summary.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h3&gt;Upgrade result&lt;/h3&gt; 
&lt;p&gt;Upgrade completed with data validation enabled. Data validation detected an expected mismatch in the &lt;code&gt;shipper_address&lt;/code&gt; column because of the charset encoding migration from unsupported charsets (&lt;code&gt;Shift_JIS&lt;/code&gt;, &lt;code&gt;GB2312&lt;/code&gt;, &lt;code&gt;EUC-KR&lt;/code&gt;) to &lt;code&gt;UTF-8&lt;/code&gt;.&lt;/p&gt; 
&lt;h3&gt;Dependency changes&lt;/h3&gt; 
&lt;p&gt;No external dependencies were changed in this project (no &lt;code&gt;requirements.txt&lt;/code&gt;).&lt;/p&gt; 
&lt;h3&gt;Job configuration changes&lt;/h3&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Parquet datetime rebase configuration key renamed.&lt;/strong&gt; 
  &lt;ul&gt; 
   &lt;li&gt;Change: &lt;code&gt;spark.sql.legacy.parquet.datetimeRebaseModeInWrite&lt;/code&gt; → &lt;code&gt;spark.sql.parquet.datetimeRebaseModeInWrite&lt;/code&gt;.&lt;/li&gt; 
   &lt;li&gt;Migration rule: In Spark 4.0, the legacy datetime rebasing SQL configurations with the prefix &lt;code&gt;spark.sql.legacy&lt;/code&gt; are removed. The SQL configuration &lt;code&gt;spark.sql.legacy.parquet.datetimeRebaseModeInWrite&lt;/code&gt; was removed in the version 4.0.0. Use &lt;code&gt;spark.sql.parquet.datetimeRebaseModeInWrite&lt;/code&gt; instead.&lt;/li&gt; 
  &lt;/ul&gt; &lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Legacy coding error action enabled.&lt;/strong&gt; 
  &lt;ul&gt; 
   &lt;li&gt;Change: Added &lt;code&gt;spark.sql.legacy.codingErrorAction&lt;/code&gt; set to &lt;code&gt;true&lt;/code&gt;.&lt;/li&gt; 
   &lt;li&gt;Migration rule: In Spark 4.0, the &lt;code&gt;encode()&lt;/code&gt; and &lt;code&gt;decode()&lt;/code&gt; functions raise &lt;code&gt;MALFORMED_CHARACTER_CODING&lt;/code&gt; error when handling unmappable characters. In Spark 3.5 and earlier versions, these characters are replaced with garbled text. To restore the previous behavior, set &lt;code&gt;spark.sql.legacy.codingErrorAction&lt;/code&gt; to &lt;code&gt;true&lt;/code&gt;.&lt;/li&gt; 
  &lt;/ul&gt; &lt;/li&gt; 
&lt;/ul&gt; 
&lt;h3&gt;Code changes&lt;/h3&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Validation attempt 1:&lt;/strong&gt; Legacy Parquet configuration key. 
  &lt;ul&gt; 
   &lt;li&gt;Validation run: EMR-Serverless &lt;code&gt;job_run_id&lt;/code&gt; &lt;code&gt;00g4vm14v118vg0b&lt;/code&gt;.&lt;/li&gt; 
   &lt;li&gt;Error: The SQL config &lt;code&gt;'spark.sql.legacy.parquet.datetimeRebaseModeInWrite'&lt;/code&gt; was removed in the version 4.0.0.&lt;/li&gt; 
   &lt;li&gt;Applied changes: &lt;code&gt;src/utils/spark_config.py&lt;/code&gt;: Changed &lt;code&gt;.config("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "LEGACY")&lt;/code&gt; to &lt;code&gt;.config("spark.sql.parquet.datetimeRebaseModeInWrite", "LEGACY")&lt;/code&gt;.&lt;/li&gt; 
  &lt;/ul&gt; &lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Validation attempt 2:&lt;/strong&gt; Parquet compression codec. 
  &lt;ul&gt; 
   &lt;li&gt;Validation run: EMR-Serverless &lt;code&gt;job_run_id&lt;/code&gt; &lt;code&gt;00g4vm5pm1hig00b&lt;/code&gt;.&lt;/li&gt; 
   &lt;li&gt;Error: &lt;code&gt;[CODEC_NOT_AVAILABLE.WITH_AVAILABLE_CODECS_SUGGESTION]&lt;/code&gt; The codec &lt;code&gt;lz4raw&lt;/code&gt; is not available.&lt;/li&gt; 
   &lt;li&gt;Applied changes: &lt;code&gt;src/domain/fleet_management/telemetry_processor.py&lt;/code&gt;: Changed &lt;code&gt;.option("compression", "lz4raw")&lt;/code&gt; to &lt;code&gt;.option("compression", "lz4_raw")&lt;/code&gt;.&lt;/li&gt; 
  &lt;/ul&gt; &lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Validation attempt 3:&lt;/strong&gt; Stricter charset encoding. 
  &lt;ul&gt; 
   &lt;li&gt;Validation run: EMR-Serverless &lt;code&gt;job_run_id&lt;/code&gt; &lt;code&gt;00g4vm8sh4sp0g0b&lt;/code&gt;.&lt;/li&gt; 
   &lt;li&gt;Error: &lt;code&gt;[MALFORMED_CHARACTER_CODING]&lt;/code&gt; Invalid value found when performing &lt;code&gt;encode&lt;/code&gt; with &lt;code&gt;ISO-8859-1&lt;/code&gt;.&lt;/li&gt; 
   &lt;li&gt;Applied changes: &lt;code&gt;src/utils/spark_config.py&lt;/code&gt;: Added &lt;code&gt;.config("spark.sql.legacy.codingErrorAction", "true")&lt;/code&gt; to the SparkSession builder.&lt;/li&gt; 
  &lt;/ul&gt; &lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Validation attempt 4:&lt;/strong&gt; Unsupported charsets. 
  &lt;ul&gt; 
   &lt;li&gt;Validation run: EMR-Serverless &lt;code&gt;job_run_id&lt;/code&gt; &lt;code&gt;00g4vmc668ng6o0b&lt;/code&gt;.&lt;/li&gt; 
   &lt;li&gt;Error: &lt;code&gt;[INVALID_PARAMETER_VALUE.CHARSET]&lt;/code&gt; charset in &lt;code&gt;encode&lt;/code&gt; is invalid: expects one of &lt;code&gt;iso-8859-1&lt;/code&gt;, &lt;code&gt;us-ascii&lt;/code&gt;, &lt;code&gt;utf-16&lt;/code&gt;, &lt;code&gt;utf-16be&lt;/code&gt;, &lt;code&gt;utf-16le&lt;/code&gt;, &lt;code&gt;utf-32&lt;/code&gt;, &lt;code&gt;utf-8&lt;/code&gt;, but got &lt;code&gt;Shift_JIS&lt;/code&gt;.&lt;/li&gt; 
   &lt;li&gt;Applied changes: &lt;code&gt;src/domain/international_shipping/shipment_processor.py&lt;/code&gt;: Replaced &lt;code&gt;Shift_JIS&lt;/code&gt;, &lt;code&gt;GB2312&lt;/code&gt;, &lt;code&gt;EUC-KR&lt;/code&gt; with &lt;code&gt;UTF-8&lt;/code&gt; for shipper and consignee address encoding.&lt;/li&gt; 
  &lt;/ul&gt; &lt;/li&gt; 
&lt;/ul&gt; 
&lt;h3&gt;Data validation result&lt;/h3&gt; 
&lt;table border="1px" width="100%" cellpadding="10px"&gt; 
 &lt;tbody&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;strong&gt;#&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Validation&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Status&lt;/strong&gt;&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;1&lt;/td&gt; 
   &lt;td&gt;Schema validation (column names, types, ordering)&lt;/td&gt; 
   &lt;td&gt;Passed (no difference)&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;2&lt;/td&gt; 
   &lt;td&gt;Row count validation (no data loss)&lt;/td&gt; 
   &lt;td&gt;Passed (no difference)&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;3&lt;/td&gt; 
   &lt;td&gt;Nullability validation (null handling changes)&lt;/td&gt; 
   &lt;td&gt;Passed (no difference)&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;4&lt;/td&gt; 
   &lt;td&gt;Statistical summary validation (numeric/string distributions)&lt;/td&gt; 
   &lt;td&gt;Failed (with difference)&lt;/td&gt; 
  &lt;/tr&gt; 
 &lt;/tbody&gt; 
&lt;/table&gt; 
&lt;p&gt;Data mismatch: 1. The &lt;code&gt;shipper_address&lt;/code&gt; column &lt;code&gt;max&lt;/code&gt; summary value changed in &lt;code&gt;customs_declarations&lt;/code&gt; output. This is expected because of the charset encoding migration from &lt;code&gt;Shift_JIS&lt;/code&gt;/&lt;code&gt;GB2312&lt;/code&gt;/&lt;code&gt;EUC-KR&lt;/code&gt; to &lt;code&gt;UTF-8&lt;/code&gt;. 2. The &lt;code&gt;shipper_address&lt;/code&gt; column &lt;code&gt;min&lt;/code&gt; summary value changed in &lt;code&gt;customs_declarations&lt;/code&gt; output for the same expected cause.&lt;/p&gt; 
&lt;h2&gt;5. Conclusion&lt;/h2&gt; 
&lt;p&gt;The AWS Spark Upgrade Agent turns a traditionally time-consuming PySpark migration into an automated, iterative workflow. For the Global Logistics Platform sample, the agent identified and resolved four distinct Spark 4.0 breaking changes: legacy Parquet configuration key removal, compression codec renames, stricter charset encoding validation, and character encoding restrictions. Each fix was applied across three domain processors, through natural language interaction in the IDE.&lt;/p&gt; 
&lt;p&gt;Every incompatibility was discovered through runtime validation on Amazon EMR Serverless. The agent submitted the unmodified application to the target application, and each failure revealed the next breaking change:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;The &lt;code&gt;spark.sql.legacy.parquet.datetimeRebaseModeInWrite&lt;/code&gt; configuration removal, which crashes SparkSession initialization.&lt;/li&gt; 
 &lt;li&gt;The &lt;code&gt;lz4raw&lt;/code&gt; → &lt;code&gt;lz4_raw&lt;/code&gt; codec rename, which fails when Parquet writes run.&lt;/li&gt; 
 &lt;li&gt;&lt;code&gt;ISO-8859-1&lt;/code&gt; encoding of CJK characters: &lt;code&gt;ISO-8859-1&lt;/code&gt; is a valid Spark 4.0 charset, so the failure surfaces only when the code runs against real multi-language data, because Spark 4.0 tightened charset encoding validation to reject unmappable characters.&lt;/li&gt; 
 &lt;li&gt;&lt;code&gt;Shift_JIS&lt;/code&gt;/&lt;code&gt;GB2312&lt;/code&gt;/&lt;code&gt;EUC-KR&lt;/code&gt; charsets removed from Spark 4.0’s supported charset list entirely.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;The agent diagnosed each error from Amazon CloudWatch logs, applied the fix, rebuilt, and resubmitted without manual intervention beyond approving each step. The data quality validation then confirmed that the upgraded application produces equivalent output on Spark 4.0.1: schema, row counts, and nullability matched exactly. The one difference, in the &lt;code&gt;shipper_address&lt;/code&gt; column, resulted from the charset migration from regional encodings to UTF-8, which actually &lt;em&gt;improved&lt;/em&gt; data quality by eliminating garbled text from incorrect encoding roundtrips. With each mismatch traced back to a specific, understood code change, the upgraded application is production-ready.&lt;/p&gt; 
&lt;table border="1px" width="100%" cellpadding="10px"&gt; 
 &lt;tbody&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;strong&gt;#&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Category&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Spark 3.x behavior&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Spark 4.0 change&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Agent fix&lt;/strong&gt;&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;1&lt;/td&gt; 
   &lt;td&gt;Parquet datetime configuration&lt;/td&gt; 
   &lt;td&gt;&lt;code&gt;spark.sql.legacy.parquet.datetimeRebaseModeInWrite&lt;/code&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;code&gt;legacy.&lt;/code&gt; prefix removed from key name&lt;/td&gt; 
   &lt;td&gt;Update configuration key&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;2&lt;/td&gt; 
   &lt;td&gt;Parquet compression&lt;/td&gt; 
   &lt;td&gt;&lt;code&gt;lz4raw&lt;/code&gt; codec name&lt;/td&gt; 
   &lt;td&gt;Renamed to &lt;code&gt;lz4_raw&lt;/code&gt; (with underscore)&lt;/td&gt; 
   &lt;td&gt;Update codec name&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;3&lt;/td&gt; 
   &lt;td&gt;Charset + CJK data&lt;/td&gt; 
   &lt;td&gt;&lt;code&gt;ISO-8859-1&lt;/code&gt; silently replaced unmappable CJK chars with &lt;code&gt;?&lt;/code&gt;&lt;/td&gt; 
   &lt;td&gt;Stricter charset validation throws &lt;code&gt;MALFORMED_CHARACTER_CODING&lt;/code&gt; for unmappable characters&lt;/td&gt; 
   &lt;td&gt;Add &lt;code&gt;spark.sql.legacy.codingErrorAction=true&lt;/code&gt;&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;4&lt;/td&gt; 
   &lt;td&gt;Character encoding&lt;/td&gt; 
   &lt;td&gt;&lt;code&gt;encode()&lt;/code&gt;/&lt;code&gt;decode()&lt;/code&gt; supported Java charsets&lt;/td&gt; 
   &lt;td&gt;Restricted to 7 standard charsets&lt;/td&gt; 
   &lt;td&gt;Replace unsupported charsets with &lt;code&gt;UTF-8&lt;/code&gt;&lt;/td&gt; 
  &lt;/tr&gt; 
 &lt;/tbody&gt; 
&lt;/table&gt; 
&lt;p&gt;&lt;strong&gt;Next steps after your first upgrade:&lt;/strong&gt;&lt;/p&gt; 
&lt;ol type="1"&gt; 
 &lt;li&gt;Apply the agent to your production PySpark code base.&lt;/li&gt; 
 &lt;li&gt;Integrate the upgrade workflow into your CI/CD pipeline.&lt;/li&gt; 
 &lt;li&gt;Explore Scala application upgrades (see Part 3 of this series).&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;To get started with your own PySpark migration:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Deploy the AWS CloudFormation templates from Section 1.2 for one-time AWS IAM, Amazon S3, and Amazon EMR Serverless setup.&lt;/li&gt; 
 &lt;li&gt;Configure the &lt;code&gt;spark-upgrade&lt;/code&gt; MCP server in your MCP-compatible IDE.&lt;/li&gt; 
 &lt;li&gt;Point the agent at your PySpark project and let it handle the rest.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;For more information, see the &lt;a href="https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/" target="_blank" rel="noopener"&gt;Amazon EMR Serverless documentation&lt;/a&gt;, the &lt;a href="https://spark.apache.org/docs/latest/migration-guide.html" target="_blank" rel="noopener"&gt;Apache Spark 4.0 migration guide&lt;/a&gt;, and the &lt;a href="https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-upgrade-agent-setup.html" target="_blank" rel="noopener"&gt;AWS Spark Upgrade Agent setup guide&lt;/a&gt;.&lt;/p&gt; 
&lt;h2&gt;6. Clean up resources&lt;/h2&gt; 
&lt;p&gt;To avoid ongoing costs, delete the resources you created:&lt;/p&gt; 
&lt;ol type="1"&gt; 
 &lt;li&gt;Delete the Amazon EMR Serverless stack: 
  &lt;div class="hide-language"&gt; 
   &lt;pre&gt;&lt;code class="language-bash"&gt;aws cloudformation delete-stack --stack-name spark-emr-serverless-upgrade --region ${SMUS_MCP_REGION}&lt;/code&gt;&lt;/pre&gt; 
  &lt;/div&gt; &lt;/li&gt; 
 &lt;li&gt;Delete the AWS IAM and Amazon S3 staging stack: 
  &lt;div class="hide-language"&gt; 
   &lt;pre&gt;&lt;code class="language-bash"&gt;aws cloudformation delete-stack --stack-name spark-upgrade-mcp-setup --region ${SMUS_MCP_REGION}&lt;/code&gt;&lt;/pre&gt; 
  &lt;/div&gt; &lt;/li&gt; 
 &lt;li&gt;If the Amazon S3 staging bucket contains objects, empty it before deleting the stack: 
  &lt;div class="hide-language"&gt; 
   &lt;pre&gt;&lt;code class="language-bash"&gt;aws s3 rm s3://${STAGING_BUCKET_PATH} --recursive&lt;/code&gt;&lt;/pre&gt; 
  &lt;/div&gt; &lt;/li&gt; 
&lt;/ol&gt; 
&lt;hr&gt; 
&lt;h2&gt;About the authors&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="aligncenter size-full wp-image-29797" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/26/BDB-5889a1.jpg" alt="Prasad Nadig" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Prasad Nadig&lt;/h3&gt; 
  &lt;p&gt;Prasad Nadig is a Senior Analytics Specialist Solutions Architect at AWS, specializing in data and AI, including data lakes, data warehousing, and analytics services such as Amazon Redshift, Amazon EMR, and AWS Glue. He helps customers architect, migrate, and modernize their data and analytics workloads to achieve scalable, performant, and cost-effective solutions on AWS.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="aligncenter size-full wp-image-29797" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/01/18/image-4-3-100x150.jpeg" alt="Karthik Prabhakar" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Karthik Prabhakar&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/karthik-prabhakar/" target="_blank" rel="noopener"&gt;Karthik&lt;/a&gt; is a Data Processing Engines Architect for Amazon EMR at Amazon Web Services (AWS). He specializes in distributed systems architecture and query optimization, working with customers to solve complex performance challenges in large-scale data processing workloads. His focus spans engine internals, cost-optimization strategies, and architectural patterns that enable customers to run petabyte-scale analytics efficiently.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="aligncenter size-full wp-image-29797" src="https://d2908q01vomqb2.cloudfront.net/f1f836cb4ea6efb2a0b1b99f41ad8b103eff4b59/2025/09/18/bezuwa.png" alt="Bezuayehu Wate" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Bezuayehu Wate&lt;/h3&gt; 
  &lt;p&gt;Bezuayehu is a Specialist Solutions Architect at AWS, specializing in big data analytics and AI solutions. She works closely with customers to modernize analytics platforms using AWS data and AI services. With a passion for emerging technologies and customer success, she thrives on designing innovative cloud solutions that deliver measurable business impact and drive organizational transformation.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="aligncenter size-full wp-image-29797" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/20/chuhanl.jpg" alt="Chuhan Liu" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Chuhan Liu&lt;/h3&gt; 
  &lt;p&gt;Chuhan is a Software Development Engineer at AWS.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="aligncenter size-full wp-image-29797" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/20/keechak.jpg" alt="Keerthi Chadalavada" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Keerthi Chadalavada&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/keerthi-chadalavada-a7344133/" target="_blank" rel="noopener"&gt;Keerthi&lt;/a&gt; is a Senior Software Development Engineer in the AWS analytics organization. She focuses on combining generative AI and data integration technologies to design and build comprehensive solutions for customer data and analytics needs.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="aligncenter size-full wp-image-29797" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2021/12/21/Pradeep-Patel.jpg" alt="Pradeep Patel" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Pradeep Patel&lt;/h3&gt; 
  &lt;p&gt;Pradeep is a Sr. Software Engineer at AWS Glue. He is passionate about helping customers solve their problems by using the power of the AWS Cloud to deliver highly scalable and robust solutions. In his spare time, he loves to hike and play with web applications.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
		
		
			</item>
		<item>
		<title>Announcing Spark Connect on Amazon EMR Serverless: Interactive PySpark development, anywhere</title>
		<link>https://aws.amazon.com/blogs/big-data/announcing-spark-connect-on-amazon-emr-serverless-interactive-pyspark-development-anywhere/</link>
		
		<dc:creator><![CDATA[Al MS]]></dc:creator>
		<pubDate>Tue, 09 Jun 2026 16:15:54 +0000</pubDate>
				<category><![CDATA[Amazon EMR]]></category>
		<category><![CDATA[Announcements]]></category>
		<category><![CDATA[Intermediate (200)]]></category>
		<category><![CDATA[Serverless]]></category>
		<guid isPermaLink="false">bde964274eb4dc29a32228a82f0a8f6d00c63305</guid>

					<description>Today, AWS is announcing support for Spark Connect on Amazon EMR Serverless with EMR release 7.13 (Apache Spark 3.5.6) and later versions. You can now build and debug Spark applications from your preferred local environment while running full-scale Spark operations on EMR Serverless.</description>
										<content:encoded>&lt;p&gt;Today, AWS is announcing support for &lt;a href="https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/spark-connect.html" target="_blank" rel="noopener"&gt;Spark Connect on Amazon EMR Serverless&lt;/a&gt; with EMR release 7.13 (&lt;a href="https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark.html" target="_blank" rel="noopener"&gt;Apache Spark&lt;/a&gt; 3.5.6) and later versions. You can now build and debug Spark applications from your preferred local environment while running full-scale Spark operations on EMR Serverless.&lt;/p&gt; 
&lt;p&gt;Previously, code that worked on a local machine might break in production because of environment mismatches, dependency conflicts, or unexpected data patterns. The only way to catch it was a deploy-and-check cycle. With the Spark Connect feature, you can develop Spark code from a supported local environment, such as an IDE (for example, &lt;a href="https://code.visualstudio.com/" target="_blank" rel="noopener"&gt;VS Code&lt;/a&gt; or &lt;a href="https://www.jetbrains.com/pycharm/" target="_blank" rel="noopener"&gt;PyCharm&lt;/a&gt;), Jupyter notebooks, &lt;a href="https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/what-is-sagemaker-unified-studio.html" target="_blank" rel="noopener"&gt;Amazon SageMaker Unified Studio&lt;/a&gt; (SMUS) &lt;a href="https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/notebooks.html" target="_blank" rel="noopener"&gt;Data Notebooks&lt;/a&gt;, &lt;a href="https://aws.amazon.com/q/developer/" target="_blank" rel="noopener"&gt;Amazon Q Developer&lt;/a&gt;, or &lt;a href="https://kiro.dev/" target="_blank" rel="noopener"&gt;Kiro&lt;/a&gt;. There are no clusters to provision, no code to repackage, and no deploy-and-check loop. Your local Python session can stay local as usual while Spark operations are automatically routed to a remote Spark server for execution.&lt;/p&gt; 
&lt;p&gt;Each Spark Connect session has its own AWS resource with a unique ARN, enabling per‑session &lt;a href="https://aws.amazon.com/iam/" target="_blank" rel="noopener"&gt;AWS Identity and Access Management (AWS IAM)&lt;/a&gt; permissions, tag‑based cost allocation, audit through &lt;a href="https://docs.aws.amazon.com/awscloudtrail/latest/userguide/cloudtrail-user-guide.html" target="_blank" rel="noopener"&gt;AWS CloudTrail&lt;/a&gt;, and session-specific configuration overrides. This gives teams finer control over who runs what, at what cost. You also get real-time visibility through the &lt;a href="https://spark.apache.org/docs/latest/web-ui.html" target="_blank" rel="noopener"&gt;Spark UI&lt;/a&gt;, persistent session history, and a dedicated interface to monitor and manage active and completed sessions.&lt;/p&gt; 
&lt;p&gt;For more details, visit the &lt;a href="https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/release-version-7130.html" target="_blank" rel="noopener"&gt;EMR Serverless release notes&lt;/a&gt; or the &lt;a href="https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/spark-connect.html" target="_blank" rel="noopener"&gt;EMR Serverless Developer Guide&lt;/a&gt;. For a quick look at the experience, here’s a demonstration of using Spark Connect in Amazon SageMaker Unified Studio Data Notebooks:&lt;/p&gt; 
&lt;div style="width: 640px;" class="wp-video"&gt;
 &lt;video class="wp-video-shortcode" id="video-91485-1" width="640" height="360" preload="metadata" controls="controls"&gt;
  &lt;source type="video/mp4" src="https://d2908q01vomqb2.cloudfront.net/artifacts/DBSBlogs/BDB-5893/bdb5893media2.mp4?_=1"&gt;
 &lt;/video&gt;
&lt;/div&gt; 
&lt;p&gt;For a runnable end-to-end example, try the &lt;a href="https://github.com/aws-samples/emr-serverless-samples/blob/main/examples/spark-connect/spark-connect-interactive.ipynb" target="_blank" rel="noopener"&gt;EMR Serverless Spark Connect sample notebook&lt;/a&gt; from your local &lt;a href="https://en.wikipedia.org/wiki/Integrated_development_environment" target="_blank" rel="noopener"&gt;IDE&lt;/a&gt;. See the following demonstration:&lt;/p&gt; 
&lt;div style="width: 640px;" class="wp-video"&gt;
 &lt;video class="wp-video-shortcode" id="video-91485-2" width="640" height="360" preload="metadata" controls="controls"&gt;
  &lt;source type="video/mp4" src="https://d2908q01vomqb2.cloudfront.net/artifacts/DBSBlogs/BDB-5893/bdb5893media1.mp4?_=2"&gt;
 &lt;/video&gt;
&lt;/div&gt; 
&lt;h2&gt;How Spark Connect works&lt;/h2&gt; 
&lt;p&gt;Spark Connect uses a client-server architecture that separates application code from the Spark engine. The client, a lightweight &lt;a href="https://spark.apache.org/docs/latest/api/python/index.html" target="_blank" rel="noopener"&gt;PySpark&lt;/a&gt; library running on a local environment, sends Spark operations over a secure &lt;a href="https://grpc.io/docs/guides/auth/" target="_blank" rel="noopener"&gt;gRPC/TLS&lt;/a&gt; connection to a Spark Connect server running on EMR Serverless. Then the server runs that Spark code on EMR Serverless as compute. Finally, it returns results to your local session.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/21/BDB-5893-2.png" alt="Spark Connect client-server architecture showing a local IDE connecting to a Spark Connect server on EMR Serverless" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;Your local machine doesn’t need Spark installed, doesn’t need direct access to the data, and doesn’t need to be sized for the workload. Because the client is a compact library, you can embed Spark operations in your Python applications that support PySpark. This includes web services, dashboards, and automation scripts. For example, a development team can add Spark-powered analytics directly into a &lt;a href="https://fastapi.tiangolo.com/" target="_blank" rel="noopener"&gt;FastAPI&lt;/a&gt; backend or a &lt;a href="https://streamlit.io/" target="_blank" rel="noopener"&gt;Streamlit&lt;/a&gt; dashboard, treating Spark like a database driver rather than a separate batch system. These capabilities extend Spark Connect use cases beyond traditional notebook and IDE development, since the compute-intensive processing happens on the server – EMR Serverless side. This allows you to use pandas, matplotlib, and your team’s internal Python libraries on your laptop or in your embedded clients, without installing those libraries on EMR Serverless.&lt;/p&gt; 
&lt;p&gt;With Spark Connect server sessions running on EMR Serverless, you pay for compute only while your session is active. When inactive, you’re not paying. EMR Serverless automatically scales compute up and down based on workload demands through &lt;a href="https://spark.apache.org/docs/latest/job-scheduling.html#dynamic-resource-allocation" target="_blank" rel="noopener"&gt;dynamic resource allocation (DRA)&lt;/a&gt;, eliminating the need to predict capacity ahead of time. For teams that run Spark Connect sessions regularly, you can configure &lt;a href="https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/pre-init-capacity.html" target="_blank" rel="noopener"&gt;pre-initialized capacity&lt;/a&gt; on your EMR Serverless application for faster session startup times. Additionally, your Spark Connect sessions have access to the full suite of EMR Serverless features, including &lt;a href="https://aws.amazon.com/ec2/graviton/?refid=d7d7f50f-1d0a-4baa-baef-d0a7c28c6909" target="_blank" rel="noopener"&gt;AWS Graviton processors&lt;/a&gt; for cost optimization and &lt;a href="https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/vpc-access.html" target="_blank" rel="noopener"&gt;secure VPC connectivity&lt;/a&gt; to your data sources. You also get access to &lt;a href="https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/using-custom-images.html" target="_blank" rel="noopener"&gt;custom images&lt;/a&gt; with flexibility and integrated observability through &lt;a href="https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/app-job-metrics.html" target="_blank" rel="noopener"&gt;Amazon CloudWatch&lt;/a&gt; and the Spark UI.&lt;/p&gt; 
&lt;h2&gt;Getting started&lt;/h2&gt; 
&lt;p&gt;Getting started with Spark Connect on EMR Serverless takes three steps: create an application, start a session, and connect from your IDE.&lt;/p&gt; 
&lt;p&gt;Note: The resources created in this quick start incur charges while active. Make sure to follow the cleanup steps at the end of this tutorial to avoid ongoing charges.&lt;/p&gt; 
&lt;h3&gt;Prerequisites&lt;/h3&gt; 
&lt;ul&gt; 
 &lt;li&gt;In addition to the &lt;a href="https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/security-iam-user-access-policies.html" target="_blank" rel="noopener"&gt;required job runtime IAM role&lt;/a&gt;, these additional permissions are needed: &lt;code&gt;emr-serverless:StartSession&lt;/code&gt;, &lt;code&gt;GetSession&lt;/code&gt;, &lt;code&gt;GetSessionEndpoint&lt;/code&gt;, &lt;code&gt;TerminateSession&lt;/code&gt;, &lt;code&gt;GetResourceDashboard&lt;/code&gt;, and &lt;code&gt;iam:PassRole&lt;/code&gt; on the runtime role.&lt;/li&gt; 
 &lt;li&gt;An existing EMR Serverless application running emr-7.13.0 or later, with &lt;code&gt;interactiveConfiguration.sessionEnabled = true&lt;/code&gt;.&lt;/li&gt; 
 &lt;li&gt;boto3 version 1.43.0 or later to access the latest EMR Serverless session APIs.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h3&gt;Step 1: Create an EMR Serverless application with Spark Connect enabled&lt;/h3&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/21/BDB-5893-3.png" alt="Amazon EMR Serverless application creation page in the EMR console" width="600"&gt;&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Open the &lt;strong&gt;Amazon EMR console&lt;/strong&gt; and navigate to EMR Serverless.&lt;/li&gt; 
 &lt;li&gt;Choose &lt;strong&gt;Get started&lt;/strong&gt;. A pop-up appears. Choose &lt;strong&gt;Create and launch EMR Studio&lt;/strong&gt;.&lt;/li&gt; 
 &lt;li&gt;This takes you to the &lt;strong&gt;Create application&lt;/strong&gt; page.&lt;/li&gt; 
 &lt;li&gt;Enter a &lt;strong&gt;Name&lt;/strong&gt; for your application (for example, spark-connect-app).&lt;/li&gt; 
 &lt;li&gt;For &lt;strong&gt;Type&lt;/strong&gt;, select &lt;strong&gt;Spark&lt;/strong&gt;.&lt;/li&gt; 
 &lt;li&gt;For &lt;strong&gt;Release version&lt;/strong&gt;, select &lt;strong&gt;emr-7.13.0&lt;/strong&gt; or later.&lt;/li&gt; 
 &lt;li&gt;For &lt;strong&gt;Architecture&lt;/strong&gt;, choose &lt;strong&gt;x86_64&lt;/strong&gt; (default). This is compatible with most third-party tools and libraries.&lt;/li&gt; 
 &lt;li&gt;Under &lt;strong&gt;Application setup options&lt;/strong&gt;, select &lt;strong&gt;Use default settings for interactive workloads&lt;/strong&gt;. This automatically sets &lt;code&gt;interactiveConfiguration.sessionEnabled = true&lt;/code&gt;.&lt;/li&gt; 
 &lt;li&gt;Choose &lt;strong&gt;Create and start application&lt;/strong&gt;.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;Alternatively, using the CLI command:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-bash"&gt;# Create an application with Spark Connect enabled
APP_ID=$(aws emr-serverless create-application \
  --type "SPARK" \
  --name "spark-connect-app" \
  --release-label emr-7.13.0 \
  --interactive-configuration '{"sessionEnabled": true}' \
  --query 'applicationId' \
  --output text)
echo "Created application: $APP_ID"
# Start the application
aws emr-serverless start-application --application-id "$APP_ID"&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;h3&gt;Step 2: Start a session&lt;/h3&gt; 
&lt;p&gt;Next, start a session and obtain the Spark Connect endpoint.&lt;/p&gt; 
&lt;p&gt;Provide an &lt;a href="https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/spark-connect.html#spark-connect-permissions" target="_blank" rel="noopener"&gt;IAM execution role&lt;/a&gt; that grants the session access to your data, such as reading data from an &lt;a href="https://aws.amazon.com/s3/" target="_blank" rel="noopener"&gt;Amazon S3&lt;/a&gt; bucket or querying the &lt;a href="https://docs.aws.amazon.com/prescriptive-guidance/latest/serverless-etl-aws-glue/aws-glue-data-catalog.html" target="_blank" rel="noopener"&gt;AWS Glue Data Catalog&lt;/a&gt;. This is the same type of role used for EMR Serverless batch jobs.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-bash"&gt;# Start a session with your execution role
$ROLE_ARN="YOUR_ROLE" # example: arn:aws:iam::123456789012:role/EMRServerlessSessionRole
SESSION_ID=$(aws emr-serverless start-session \
  --application-id $APP_ID \
  --execution-role-arn $ROLE_ARN \
  --query sessionId \
  --output text)

# Get the session endpoint
aws emr-serverless get-session-endpoint \
  --application-id $APP_ID \
  --session-id $SESSION_ID&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;The &lt;code&gt;get-session-endpoint&lt;/code&gt; response includes a secure endpoint URL and an authentication token. All communication between your local environment and EMR Serverless is encrypted using &lt;a href="https://en.wikipedia.org/wiki/Transport_Layer_Security" target="_blank" rel="noopener"&gt;TLS&lt;/a&gt;. Treat the token as a sensitive credential. Consider using &lt;a href="https://aws.amazon.com/secrets-manager/" target="_blank" rel="noopener"&gt;AWS Secrets Manager&lt;/a&gt; to store and retrieve tokens programmatically. The authentication token is time-limited to 1 hour, so for long-running sessions we recommend that you refresh it periodically.&lt;/p&gt; 
&lt;h3&gt;Step 3: Connect from your local IDE&lt;/h3&gt; 
&lt;p&gt;Use the returned endpoint URL and authentication token to connect to the Spark Connect server.&lt;/p&gt; 
&lt;p&gt;The connection URL uses the &lt;code&gt;sc://&lt;/code&gt; protocol, which is the Spark Connect standard. The &lt;code&gt;use_ssl=true&lt;/code&gt; parameter supports encrypted communications over TLS, so your data and credentials are protected in transit.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-python"&gt;from pyspark.sql import SparkSession

# Use the endpoint and auth token from get-session-endpoint
session_endpoint="&amp;lt;endpoint-from-get-session-endpoint&amp;gt;"
auth_token="&amp;lt;authToken-from-get-session-endpoint&amp;gt;"

spark_connect_url = (
    f"sc://{session_endpoint}:443/;use_ssl=true;x-aws-proxy-auth={auth_token}"
)

spark = SparkSession.builder \
    .remote(spark_connect_url) \
    .getOrCreate()

# Query data in your S3 data lake
df = spark.sql("SELECT * FROM my_catalog.my_database.my_table")
df.show()

# Run transformations at scale
df.groupBy("category").count().orderBy("count", ascending=False).show()
spark.stop()&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Once connected, Spark operations you write in your IDE can be run on EMR Serverless. For debugging, you can pause the execution at &lt;a href="https://en.wikipedia.org/wiki/Breakpoint#Types_of_breakpoints" target="_blank" rel="noopener"&gt;breakpoints&lt;/a&gt;, &lt;a href="https://en.wikipedia.org/wiki/Variable_(high-level_programming_language)" target="_blank" rel="noopener"&gt;inspect variables&lt;/a&gt;, and step through your transformations locally while EMR Serverless processes your data on remote, scalable infrastructure.&lt;/p&gt; 
&lt;p&gt;Sessions remain active for a configurable idle timeout (1 hour by default). If your connection drops, the session continues running, allowing you to reconnect without losing your work. You can also access the live Spark UI through the &lt;a href="https://docs.aws.amazon.com/emr-serverless/latest/APIReference/API_GetResourceDashboard.html" target="_blank" rel="noopener"&gt;GetResourceDashboard&lt;/a&gt; API to monitor queries, stages, and executors in real time. After the session ends, the Spark History Server remains available for post-run analysis.&lt;/p&gt; 
&lt;h3&gt;Clean up resources&lt;/h3&gt; 
&lt;p&gt;If the 1-hour session idle timeout does not meet your needs, you can manually remove sessions to avoid ongoing costs. Note that terminating an active session will immediately stop you running Spark operations. Before doing that, verify all your critical data processing is completed, and results are saved.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-bash"&gt;# 1. Stop the active session
aws emr-serverless terminate-session \
  --application-id $APP_ID \
  --session-id $SESSION_ID

# 2. Stop the application
aws emr-serverless stop-application --application-id $APP_ID&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;h2&gt;Use cases&lt;/h2&gt; 
&lt;p&gt;Spark Connect on EMR Serverless supports a wide range of development workflows. The following are some of the most popular use cases, including but not limited to:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Interactive ETL development&lt;/strong&gt; — Build and test data pipelines interactively, validating transformations against full-scale datasets before promoting them to production as batch jobs.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;SageMaker Unified Studio (SMUS)&lt;/strong&gt; &lt;a href="https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/notebooks.html" target="_blank" rel="noopener"&gt;Data Notebooks&lt;/a&gt; — Run interactive&amp;nbsp;PySpark&amp;nbsp;sessions directly from SMUS Data Notebooks connected to EMR Serverless&amp;nbsp;through&amp;nbsp;Spark Connect.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Direct S3 and &lt;a href="https://en.wikipedia.org/wiki/Java_Database_Connectivity" target="_blank" rel="noopener"&gt;JDBC&lt;/a&gt; access without a catalog&lt;/strong&gt; — Connect directly to S3 files and JDBC data sources without needing a metastore or catalog configuration.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;&lt;a href="https://iceberg.apache.org/" target="_blank" rel="noopener"&gt;Apache Iceberg&lt;/a&gt; Data Lakehouse analytics&lt;/strong&gt; — Query and manage Iceberg tables through the AWS Glue Data Catalog, with full support for &lt;a href="https://iceberg.apache.org/docs/latest/spark-queries/#time-travel-queries-with-sql" target="_blank" rel="noopener"&gt;time travel,&lt;/a&gt; &lt;a href="https://iceberg.apache.org/docs/latest/evolution/#schema-evolution" target="_blank" rel="noopener"&gt;schema evolution,&lt;/a&gt; and &lt;a href="https://iceberg.apache.org/docs/latest/partitioning/" target="_blank" rel="noopener"&gt;partition management&lt;/a&gt;.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;&lt;a href="https://aws.amazon.com/s3/features/tables/" target="_blank" rel="noopener"&gt;Amazon S3 Tables&lt;/a&gt; with federated catalog&lt;/strong&gt; — Access S3 Tables as a federated Glue Data Catalog source, combining Iceberg features with serverless Spark execution.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;&lt;a href="https://docs.getdbt.com/docs/local/connect-data-platform/spark-setup?version=1.13" target="_blank" rel="noopener"&gt;dbt-spark&lt;/a&gt;&lt;/strong&gt; — Run dbt-spark adapter against EMR Serverless via Spark Connect, allowing analytics engineers to develop and test transformations locally with &lt;a href="https://docs.getdbt.com/docs/introduction" target="_blank" rel="noopener"&gt;dbt&lt;/a&gt; framework while using EMR Serverless as the remote Spark engine.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Exploratory data analysis and &lt;a href="https://aws.amazon.com/what-is/feature-engineering/" target="_blank" rel="noopener"&gt;feature engineering&lt;/a&gt;&lt;/strong&gt; — Analyze production-scale data from your preferred notebook environment instead of using sampled subsets, helping you catch data quality issues earlier.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Compute standardization&lt;/strong&gt; — Standardize EMR Serverless as the Spark backend while giving you the flexibility to use preferred local tools, version control, and &lt;a href="https://docs.aws.amazon.com/whitepapers/latest/cicd_for_5g_networks_on_aws/cicd-on-aws.html" target="_blank" rel="noopener"&gt;CI/CD&lt;/a&gt; workflows.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;These use cases work across multiple client surfaces: IDEs, Jupyter notebooks, dbt-spark, and AI coding agents. Because Spark Connect is an open Apache Spark standard, the same PySpark code typically works across different Spark backends by changing the connection endpoint.&lt;/p&gt; 
&lt;h2&gt;Availability and pricing&lt;/h2&gt; 
&lt;p&gt;Spark Connect on EMR Serverless is now available with Apache Spark 3.5.6 on Amazon EMR release 7.13 and higher in all &lt;a href="https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/considerations.html" target="_blank" rel="noopener"&gt;AWS Regions&lt;/a&gt; where EMR Serverless is available. There is no additional charge for using Spark Connect. You pay for the EMR Serverless compute resources (&lt;a href="https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/instance-optimize-cpu.html" target="_blank" rel="noopener"&gt;vCPU&lt;/a&gt;, memory, and storage) consumed during your session, the same pricing model as EMR Serverless batch jobs.&lt;/p&gt; 
&lt;h2&gt;Conclusion&lt;/h2&gt; 
&lt;p&gt;Spark Connect on EMR Serverless bridges the gap between local development and production-scale execution. Build and debug PySpark applications from your preferred environment (IDE, notebook, dbt, or AI coding agent) while EMR Serverless handles automatic scaling, per-session cost visibility, and infrastructure management behind the scenes. With ARN-addressable sessions, fine-grained IAM permissions, tag-based cost allocation, and per-session configuration overrides, your team gets the controls they need without sacrificing flexibility.&lt;/p&gt; 
&lt;p&gt;Get started today with EMR release 7.13.0 (Spark 3.5.6). Follow the step-by-step tutorial in the &lt;a href="https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/spark-connect.html" target="_blank" rel="noopener"&gt;EMR Serverless Developer Guide&lt;/a&gt; to create your first Spark Connect session and experience interactive, serverless PySpark development firsthand.&lt;/p&gt; 
&lt;hr&gt; 
&lt;h2&gt;About the authors&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignleft size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2020/12/23/Al-MS.jpg" alt="Al MS" width="100" height="100"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Al MS&lt;/h3&gt; 
  &lt;p&gt;Al is a product manager for Amazon EMR at AWS.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignleft size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2023/12/21/Melody-Yang-100px.jpg" alt="Melody Yang" width="100" height="100"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Melody Yang&lt;/h3&gt; 
  &lt;p&gt;Melody Yang is a Principal Analytics Specialist Solution Architect at AWS with expertise in Big Data technologies. She is an experienced analytics leader working with AWS customers to provide best practice guidance and technical advice in order to assist their success in data transformation. Her areas of interests are open-source frameworks and automation, data engineering and DataOps.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignleft size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/21/bdb5893a3.jpg" alt="KiKi Nwangwu" width="100" height="100"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;KiKi Nwangwu&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/nkeirunwangwuokafor/" target="_blank" rel="noopener"&gt;Kiki&lt;/a&gt; is an Analytics and GenAI Specialist Solutions Architect at AWS. She specializes in helping customers architect, build, and modernize scalable data analytics and generative AI solutions. She enjoys traveling and exploring new cultures.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
		
		<enclosure length="1495074" type="video/mp4" url="https://d2908q01vomqb2.cloudfront.net/artifacts/DBSBlogs/BDB-5893/bdb5893media1.mp4"/>
<enclosure length="2615863" type="video/mp4" url="https://d2908q01vomqb2.cloudfront.net/artifacts/DBSBlogs/BDB-5893/bdb5893media2.mp4"/>

			</item>
		<item>
		<title>Build stateful streaming applications with Apache Spark 4.0 on Amazon EMR Serverless</title>
		<link>https://aws.amazon.com/blogs/big-data/build-stateful-streaming-applications-with-apache-spark-4-0-on-amazon-emr-serverless/</link>
		
		<dc:creator><![CDATA[Raj Ramasubbu]]></dc:creator>
		<pubDate>Tue, 09 Jun 2026 16:14:54 +0000</pubDate>
				<category><![CDATA[Amazon EMR]]></category>
		<category><![CDATA[Analytics]]></category>
		<category><![CDATA[Serverless]]></category>
		<guid isPermaLink="false">023a34a49a0b816f48644e35b9a4d0110ed89c13</guid>

					<description>In this post, we demonstrate how to build a production-ready IoT device monitoring system using Spark 4.0’s transformWithState API on Amazon EMR Serverless. This example showcases the key capabilities of stateful streaming and provides a template you can adapt for your own use cases.</description>
										<content:encoded>&lt;p&gt;&lt;a href="https://spark.apache.org/releases/spark-release-4-0-0.html" target="_blank" rel="noopener"&gt;Apache Spark 4.0&lt;/a&gt; represents a major milestone in stream processing, introducing new capabilities that fundamentally change how developers build stateful streaming applications. At the heart of these improvements is the &lt;a href="https://spark.apache.org/docs/latest/streaming/structured-streaming-transform-with-state.html" target="_blank" rel="noopener"&gt;&lt;code&gt;transformWithState&lt;/code&gt;&lt;/a&gt; API – a new capability that enables first-class support for timers, automatic state management, and schema evolution to &lt;a href="https://spark.apache.org/docs/latest/streaming/index.html" target="_blank" rel="noopener"&gt;Spark Structured Streaming&lt;/a&gt;.&lt;/p&gt; 
&lt;p&gt;With Spark 4.0 now available on &lt;a href="https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/release-version-emr-spark-8.0-preview.html" target="_blank" rel="noopener"&gt;Amazon EMR Serverless&lt;/a&gt;, developers can build stateful streaming applications using the &lt;code&gt;transformWithState&lt;/code&gt; API in a fully managed, serverless environment that automatically scales based on workload demands. This combination delivers the power of sophisticated stream processing without the operational overhead of cluster management.&lt;/p&gt; 
&lt;p&gt;In this post, we demonstrate how to build a production-ready IoT device monitoring system using Spark 4.0’s &lt;code&gt;transformWithState&lt;/code&gt; API on Amazon EMR Serverless. This example showcases the key capabilities of stateful streaming and provides a template you can adapt for your own use cases.&lt;/p&gt; 
&lt;h2&gt;Apache Spark 4.0: introducing transformWithState&lt;/h2&gt; 
&lt;p&gt;&lt;a href="https://spark.apache.org/releases/spark-release-4-0-1.html" target="_blank" rel="noopener"&gt;Apache Spark 4.0&lt;/a&gt;’s latest streaming features solve common production challenges in stateful applications by introducing native timer support and advance state management capabilities for complex event processing workflows. The new &lt;code&gt;transformWithState&lt;/code&gt; API provides:&lt;/p&gt; 
&lt;h3&gt;Key features of transformWithState&lt;/h3&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Native timer support:&lt;/strong&gt; Register timers that fire callbacks at specific times for use cases like heartbeat monitoring, session timeout detection, and SLA violation alerts.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Automatic state TTL (Time-To-Live):&lt;/strong&gt; Configure automatic expiration policies to prevent state from growing indefinitely. This is useful for use cases like session state size control, clearing stale device telemetry, maintaining a recency cache, or tracking invalid logins within the last hour for fraud detection.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Schema evolution:&lt;/strong&gt; Evolve state schema without restarting from a new checkpoint. Add optional fields, remove fields, or widen numeric types. This is particularly valuable for use cases where data structures are dynamic, and application downtime for schema migration is not acceptable, enabling more resilient and flexible real-time streaming applications.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Multiple state variables:&lt;/strong&gt; Support for multiple independent state variables (&lt;code&gt;ValueState&lt;/code&gt;, &lt;code&gt;ListState&lt;/code&gt;, &lt;code&gt;MapState&lt;/code&gt;) per key, well-suited for building complex, real-time applications that require sophisticated state management, such as storing a history of recent error codes, tracking counts of various alert types, or maintaining multiple dimensions of user activity within a single stateful operator.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;State observability:&lt;/strong&gt; Query application state mid-stream using the State Data Source Reader for debugging and monitoring. This is especially valuable in applications that require maintaining and evolving state through several steps, such as detection of sophisticated event patterns across multiple streams and over time, where visibility into state transitions is critical for troubleshooting and validation.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Operator chaining:&lt;/strong&gt; Chain multiple stateful operators together for complex multi-stage processing pipelines.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;These capabilities make Spark 4.0 ideal for applications that were previously difficult or impossible to implement efficiently, such as complex event processing, session analytics, anomaly detection, and real-time monitoring systems.&lt;/p&gt; 
&lt;h2&gt;Use case: IoT heartbeat monitoring&lt;/h2&gt; 
&lt;p&gt;Consider a fleet of 100,000 IoT sensors deployed across manufacturing facilities. Each sensor sends a heartbeat signal every 20 seconds to indicate it’s operational. Your operations team needs to be alerted within 30 seconds if any sensor goes offline, with repeat alerts every 60 seconds until the sensor comes back online.&lt;/p&gt; 
&lt;p&gt;This seemingly simple requirement presents several technical challenges. The application must maintain the last heartbeat timestamp for each of the 100,000 devices while independently managing timers to detect missed signals per device. It also needs to handle out-of-order heartbeats caused by network delays and clean up state for decommissioned devices to prevent unbounded memory growth. All of this must happen at scale, processing millions of events per minute with low latency, while recovering gracefully from failures without losing state.&lt;/p&gt; 
&lt;p&gt;To address the specific challenges of IoT heartbeat monitoring described above, we present a solution built on the &lt;a href="https://spark.apache.org/docs/latest/streaming/structured-streaming-transform-with-state.html" target="_blank" rel="noopener"&gt;&lt;code&gt;transformWithState&lt;/code&gt; API&lt;/a&gt; in Spark 4.0. With its native timer support, automatic state management, and built-in fault tolerance, making it the ideal solution for IoT heartbeat monitoring at scale.&lt;/p&gt; 
&lt;h2&gt;Solution overview&lt;/h2&gt; 
&lt;p&gt;Our solution architecture follows a serverless, event-driven design:&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/12/BDB-5849-1.png" alt="Solution architecture showing IoT devices sending heartbeats to Kinesis Data Streams, processed by EMR Serverless with transformWithState, checkpointed to Amazon S3, and alerts delivered via Amazon SNS" width="600"&gt;&lt;/p&gt; 
&lt;ol type="1"&gt; 
 &lt;li&gt;IoT devices send heartbeat events to &lt;strong&gt;Amazon Kinesis Data Streams&lt;/strong&gt; containing device ID, timestamp, and metadata (battery level, signal strength, firmware version).&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Amazon EMR Serverless&lt;/strong&gt; reads from Kinesis using the Spark aws-kinesis connector using &lt;strong&gt;VPC Endpoint for Kinesis&lt;/strong&gt;, then parses JSON events into structured DataFrames and grouping by &lt;code&gt;device_id&lt;/code&gt;.&lt;/li&gt; 
 &lt;li&gt;&lt;code&gt;transformWithState&lt;/code&gt; processes each device’s stream. On heartbeat arrival, it updates state and registers a 30-second timer; when the timer expires without a new heartbeat, it emits an offline alert.&lt;/li&gt; 
 &lt;li&gt;State is automatically persisted to &lt;a href="https://rocksdb.org/" target="_blank" rel="noopener"&gt;RocksDB&lt;/a&gt; locally and checkpointed to &lt;strong&gt;Amazon Simple Storage Service (Amazon S3)&lt;/strong&gt;, enabling fault-tolerant recovery and exactly-once processing semantics.&lt;/li&gt; 
 &lt;li&gt;Alerts are delivered via &lt;strong&gt;Amazon Simple Notification Service (Amazon SNS)&lt;/strong&gt; to configured subscribers (email, SMS, AWS Lambda, webhooks).&lt;/li&gt; 
&lt;/ol&gt; 
&lt;h2&gt;Prerequisites&lt;/h2&gt; 
&lt;p&gt;Before implementing this solution, verify that you have:&lt;/p&gt; 
&lt;ol type="1"&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/setting-up.html" target="_blank" rel="noopener"&gt;&lt;strong&gt;AWS account:&lt;/strong&gt;&lt;/a&gt; With permissions for EMR Serverless, Kinesis, SNS, S3, VPC, and IAM.&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/cli/latest/userguide/getting-started-quickstart.html" target="_blank" rel="noopener"&gt;&lt;strong&gt;AWS Command Line Interface (AWS CLI):&lt;/strong&gt;&lt;/a&gt; Configured with appropriate credentials.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;&lt;a href="https://docs.aws.amazon.com/vpc/latest/userguide/vpc-example-private-subnets-nat.html" target="_blank" rel="noopener"&gt;VPC setup&lt;/a&gt;:&lt;/strong&gt; VPC with private subnets and security groups configured.&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/vpc/latest/privatelink/interface-endpoints.html" target="_blank" rel="noopener"&gt;&lt;strong&gt;Kinesis VPC interface endpoint:&lt;/strong&gt;&lt;/a&gt; VPC endpoint for private connectivity to Kinesis.&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/streams/latest/dev/how-do-i-create-a-stream.html" target="_blank" rel="noopener"&gt;&lt;strong&gt;Kinesis Data Stream:&lt;/strong&gt;&lt;/a&gt; Created for ingesting heartbeat events (for example, &lt;code&gt;iot-heartbeats&lt;/code&gt;). For testing your streaming data solution, refer to &lt;a href="https://aws.amazon.com/blogs/big-data/test-your-streaming-data-solution-with-the-new-amazon-kinesis-data-generator/" target="_blank" rel="noopener"&gt;Test your streaming data solution with the new Amazon Kinesis Data Generator&lt;/a&gt;.&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/sns/latest/dg/sns-getting-started.html" target="_blank" rel="noopener"&gt;&lt;strong&gt;SNS topic:&lt;/strong&gt;&lt;/a&gt; Created for sending alerts (for example, &lt;code&gt;iot-alerts&lt;/code&gt;).&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;&lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/GetStartedWithS3.html#creating-bucket" target="_blank" rel="noopener"&gt;S3 bucket&lt;/a&gt;:&lt;/strong&gt; For storing application code, dependencies, and checkpoints.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;h2&gt;Step-by-step implementation&lt;/h2&gt; 
&lt;p&gt;The following steps walk you through setting up an EMR Serverless application with Spark 4.0, configuring the stateful streaming processor, and deploying the IoT heartbeat monitoring solution.&lt;/p&gt; 
&lt;h3&gt;Step 1: Create the EMR serverless application&lt;/h3&gt; 
&lt;p&gt;Run the following command in your terminal using the AWS CLI. Replace the subnet and security group IDs with the values from your VPC setup.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-bash"&gt;# Create EMR Serverless application with Spark 4.0 and VPC
aws emr-serverless create-application \
  --name "iot-heartbeat-monitor" \
  --release-label "emr-spark-8.0.0" \
  --type "SPARK" \
  --network-configuration '{
    "subnetIds": ["subnet-xxxxx", "subnet-yyyyy"],
    "securityGroupIds": ["sg-zzzzz"]
  }' \
  --region us-east-1&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;The command returns a JSON response containing the application details. Note the &lt;code&gt;applicationId&lt;/code&gt; value from the output, as you will need it in subsequent steps.&lt;/p&gt; 
&lt;h3&gt;Step 2: Implement the heartbeat monitor&lt;/h3&gt; 
&lt;p&gt;The core of our solution is the &lt;code&gt;HeartbeatMonitor&lt;/code&gt; class that extends &lt;code&gt;StatefulProcessor&lt;/code&gt;. This class demonstrates the key features of Spark 4.0’s &lt;code&gt;transformWithState&lt;/code&gt; API. Download the full &lt;a href="https://github.com/aws-samples/sample-amazon-emr-spark4-examples/blob/main/pyspark/AWSSpark4TransformWithState/heartbeat_monitor_final.py" target="_blank" rel="noopener"&gt;implementation script&lt;/a&gt; and upload it to your local S3 bucket for execution. Let’s walk through each component to understand how it works.&lt;/p&gt; 
&lt;h4&gt;2.1 Initialize state variables&lt;/h4&gt; 
&lt;p&gt;The &lt;code&gt;init()&lt;/code&gt; method is called once when the processor is initialized. This is where we define and register our state variables.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-python"&gt;from pyspark.sql.streaming.stateful_processor import (
    StatefulProcessor, StatefulProcessorHandle
)

class HeartbeatMonitor(StatefulProcessor):

    def init(self, handle: StatefulProcessorHandle) -&amp;gt; None:
        self.handle = handle

        # Define state schemas
        last_seen_schema = StructType([
            StructField("timestamp", TimestampType(), True)
        ])

        device_info_schema = StructType([
            StructField("battery_level", StringType(), True),
            StructField("firmware_version", StringType(), True)
        ])

        # Initialize multiple independent state variables
        self.last_seen = handle.getValueState("last_seen", last_seen_schema)
        self.device_info = handle.getValueState(
            "device_info", device_info_schema
        )&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;In the &lt;code&gt;init()&lt;/code&gt; method, we use &lt;code&gt;StatefulProcessorHandle&lt;/code&gt; to define and initialize two per-key state variables, &lt;code&gt;last_seen&lt;/code&gt; and &lt;code&gt;device_info&lt;/code&gt;, using Spark’s &lt;code&gt;StructType&lt;/code&gt; schemas and the &lt;code&gt;getValueState()&lt;/code&gt; API. These state variables are automatically stored in &lt;a href="https://aws.amazon.com/blogs/big-data/rocksdb-101-optimizing-stateful-streaming-in-apache-spark-with-amazon-emr-and-aws-glue/" target="_blank" rel="noopener"&gt;RocksDB&lt;/a&gt; and checkpointed to S3, allowing for fault-tolerant state management across streaming micro-batches.&lt;/p&gt; 
&lt;h4&gt;2.2 Handle incoming heartbeat events and register timers&lt;/h4&gt; 
&lt;p&gt;The &lt;code&gt;handleInputRows()&lt;/code&gt; method is called whenever new events arrive for a device. This is where we update state and register timers.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-python"&gt;def handleInputRows(
    self, key: tuple, rows: Iterator[pd.DataFrame], timerValues
) -&amp;gt; Iterator[pd.DataFrame]:
    device_id = key[0]

    # Process incoming heartbeats - iterate through all rows to find latest
    latest_timestamp = None
    for pdf in rows:
        for _, row in pdf.iterrows():
            ts = row['timestamp']
            if pd.isna(ts):
                continue
            if latest_timestamp is None or ts &amp;gt; latest_timestamp:
                latest_timestamp = ts

    if latest_timestamp is None:
        yield pd.DataFrame()
        return

    # Check if we have existing state
    existing_timestamp = None
    if self.last_seen.exists():
        existing_state = self.last_seen.get()
        existing_timestamp = existing_state[0]

    # Update state only if new heartbeat is more recent
    if existing_timestamp is None or latest_timestamp &amp;gt; existing_timestamp:
        # Cancel existing timers (device is back online)
        for timer in self.handle.listTimers():
            self.handle.deleteTimer(timer)

        # Update state with new timestamp
        self.last_seen.update((latest_timestamp,))

        # Register timer for heartbeat deadline detection
        current_time_ms = timerValues.getCurrentProcessingTimeInMs()
        deadline_ms = current_time_ms + HEARTBEAT_INTERVAL_MS
        # 30 seconds from now
        self.handle.registerTimer(deadline_ms)

    yield pd.DataFrame()  # No output from input handling&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;The &lt;code&gt;handleInputRows()&lt;/code&gt; method processes incoming heartbeat events for each device by extracting the latest timestamp, updating the &lt;code&gt;last_seen&lt;/code&gt; state, and managing timers. It cancels existing ones and registering a new 30-second expiry timer to detect future inactivity. Because alerts are only emitted upon timer expiration, the method yields an empty dataframe during normal heartbeat processing.&lt;/p&gt; 
&lt;h4&gt;2.3 Handle timer expiration and emit alerts&lt;/h4&gt; 
&lt;p&gt;The &lt;code&gt;handleExpiredTimer()&lt;/code&gt; method is called when a registered timer fires. This is where we detect offline devices and emit alerts.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-python"&gt;def handleExpiredTimer(
    self, key: tuple, timerValues, expiredTimerInfo
) -&amp;gt; Iterator[pd.DataFrame]:
    device_id = key[0]
    current_time_ms = timerValues.getCurrentProcessingTimeInMs()

    # Verify state exists
    if not self.last_seen.exists():
        yield pd.DataFrame()
        return

    # Get last seen timestamp from state
    last_seen_state = self.last_seen.get()
    last_seen_timestamp = last_seen_state[0]

    if last_seen_timestamp is None or pd.isna(last_seen_timestamp):
        yield pd.DataFrame()
        return

    # Calculate how long device has been offline
    last_seen_ms = int(last_seen_timestamp.timestamp() * 1000)
    offline_duration_ms = current_time_ms - last_seen_ms
    offline_duration_seconds = offline_duration_ms / 1000.0

    # Create alert as a Pandas DataFrame
    alert_df = pd.DataFrame({
        "device_id": [device_id],
        "alert_type": ["DEVICE_OFFLINE"],
        "last_seen": [last_seen_timestamp],
        "offline_duration_seconds": [offline_duration_seconds],
        "alert_timestamp": [datetime.fromtimestamp(current_time_ms / 1000.0)]
    })

    # Register another timer for repeat alerts (every 60 seconds)
    next_alert_time = current_time_ms + ALERT_REPEAT_INTERVAL_MS
    self.handle.registerTimer(next_alert_time)

    yield alert_df  # Emit the alert&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;The &lt;code&gt;handleExpiredTimer()&lt;/code&gt; method is triggered automatically when a device’s inactivity timer expires, retrieving the &lt;code&gt;last_seen&lt;/code&gt; state to calculate the offline duration and yielding an alert dataframe to the output stream. It also registers a follow-up timer for repeat alerts every 60 seconds, which continues until a new heartbeat arrives and cancels the timer via &lt;code&gt;handleInputRows()&lt;/code&gt;.&lt;/p&gt; 
&lt;p&gt;There are several ways you could extend this solution for production use. You could implement exponential backoff for repeat alerts to reduce noise, for example, alerting after 60 seconds, then 2 minutes, then 5 minutes, and so on. Other improvements could include adding severity escalation based on offline duration, integrating with notification services like Amazon SNS for downstream alerting, or setting a maximum retry limit to stop alerts for permanently decommissioned devices.&lt;/p&gt; 
&lt;h4&gt;2.4 Apply transformWithState to the streaming DataFrame&lt;/h4&gt; 
&lt;p&gt;Now we connect everything together by applying our &lt;code&gt;HeartbeatMonitor&lt;/code&gt; processor to the streaming data.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-python"&gt;# Read and parse heartbeat events from Kinesis
parsed_df = kinesis_df \
    .selectExpr("CAST(data AS STRING) as json_data") \
    .select(from_json(col("json_data"), heartbeat_schema).alias("heartbeat")) \
    .select(
        col("heartbeat.device_id"),
        to_timestamp(col("heartbeat.timestamp")).alias("timestamp"),
        col("heartbeat.battery_level"),
        col("heartbeat.signal_strength"),
        col("heartbeat.firmware_version")
    )

# Apply transformWithState for stateful processing
alerts_df = parsed_df \
    .groupBy("device_id") \
    .transformWithStateInPandas(
        statefulProcessor=HeartbeatMonitor(),
        outputStructType=alert_output_schema,
        outputMode="append",
        timeMode="processingTime"
    )

# Write alerts to SNS
query = alerts_df.writeStream \
    .outputMode("append") \
    .foreachBatch(send_to_sns) \
    .option("checkpointLocation", CHECKPOINT_LOCATION) \
    .trigger(processingTime="10 seconds") \
    .start()

# Send to SNS for alerts
def send_to_sns(batch_df, batch_id):
    if batch_df.count() &amp;gt; 0:
        sns_client = boto3.client('sns', region_name=KINESIS_REGION)
        for row in batch_df.collect():
            message = {
                "device_id": row["device_id"],
                "alert_type": row["alert_type"],
                "last_seen": str(row["last_seen"]),
                "offline_duration_seconds": row["offline_duration_seconds"],
                "alert_timestamp": str(row["alert_timestamp"])
            }
            sns_client.publish(
                TopicArn=SNS_TOPIC_ARN,
                Message=json.dumps(message),
                Subject=f"Device Offline Alert: {row['device_id']}"
            )&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;The streaming pipeline parses JSON heartbeat events from Kinesis, partitions them by &lt;code&gt;device_id&lt;/code&gt;, and applies the &lt;code&gt;HeartbeatMonitor&lt;/code&gt; stateful processor using &lt;code&gt;transformWithStateInPandas()&lt;/code&gt; with processing-time timers and append output mode. The resulting alert stream is written to SNS via &lt;code&gt;foreachBatch()&lt;/code&gt; with checkpointing enabled for fault tolerance and micro-batches triggered every 10 seconds.&lt;/p&gt; 
&lt;p&gt;To summarize, implementing the heartbeat monitor requires just three methods. The &lt;code&gt;init()&lt;/code&gt; method sets up your state variables, &lt;code&gt;handleInputRows()&lt;/code&gt; processes incoming heartbeats and manages timers, and &lt;code&gt;handleExpiredTimer()&lt;/code&gt; generates offline alerts. The &lt;code&gt;transformWithState&lt;/code&gt; API handles the underlying complexity of state management, checkpointing, and timer scheduling automatically.&lt;/p&gt; 
&lt;h3&gt;Step 3: Create IAM role for job execution&lt;/h3&gt; 
&lt;p&gt;Create an IAM role that allows EMR Serverless to assume it for running your Spark job. For detailed instructions on creating an IAM role, see &lt;a href="https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_create_for-service.html" target="_blank" rel="noopener"&gt;Creating an IAM role&lt;/a&gt;. Use the following trust policy for the role.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-json"&gt;{
  "Version": "2012-10-17",
  "Statement": [{
    "Effect": "Allow",
    "Principal": {
      "Service": "emr-serverless.amazonaws.com"
    },
    "Action": "sts:AssumeRole"
  }]
}&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Attach a permissions policy that grants the role access to read from the Kinesis stream, write to the S3 bucket for checkpoints and application artifacts, and publish alerts to the SNS topic:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-json"&gt;{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "KinesisAccess",
      "Effect": "Allow",
      "Action": [
        "kinesis:GetRecords",
        "kinesis:GetShardIterator",
        "kinesis:DescribeStream",
        "kinesis:DescribeStreamSummary",
        "kinesis:ListShards",
        "kinesis:SubscribeToShard"
      ],
      "Resource": "arn:aws:kinesis:us-east-1:*:stream/iot-heartbeats"
    },
    {
      "Sid": "SNSPublish",
      "Effect": "Allow",
      "Action": "sns:Publish",
      "Resource": "arn:aws:sns:us-east-1:*:iot-alerts"
    },
    {
      "Sid": "S3Access",
      "Effect": "Allow",
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::your-bucket",
        "arn:aws:s3:::your-bucket/*"
      ]
    }
  ]
}&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;h3&gt;Step 4: Upload external dependencies required for executing the streaming job&lt;/h3&gt; 
&lt;p&gt;In this step, you will download the required external dependencies and upload them to your S3 bucket to make them available for your EMR Serverless streaming job.&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Spark-kinesis-connector.jar (&lt;a href="https://aws-blogs-artifacts-public.s3.us-east-1.amazonaws.com/artifacts/BDB-5849/spark-streaming-sql-kinesis-connector_2.13-2.0.0-SNAPSHOT.jar" target="_blank" rel="noopener"&gt;download link&lt;/a&gt;) and copy to local S3 bucket &lt;code&gt;s3://your-bucket/jars/spark-kinesis-connector.jar&lt;/code&gt;.&lt;/li&gt; 
 &lt;li&gt;Protobuf Dependency (&lt;a href="https://aws-blogs-artifacts-public.s3.us-east-1.amazonaws.com/artifacts/BDB-5849/protobuf_pkg.tar" target="_blank" rel="noopener"&gt;download link&lt;/a&gt;) and copy to local S3 bucket &lt;code&gt;s3://your-bucket/pyfiles/protobuf_pkg.tar.gz&lt;/code&gt;.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h3&gt;Step 5: Submit the streaming job&lt;/h3&gt; 
&lt;p&gt;Now that the application, IAM role, and dependencies are in place, you can submit the streaming job. This step configures the Spark job parameters and submits it to your EMR Serverless application in streaming mode. For more details on submitting jobs, see &lt;a href="https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/jobs.html" target="_blank" rel="noopener"&gt;Starting a job run.&lt;/a&gt;&lt;/p&gt; 
&lt;p&gt;First, create a file named &lt;code&gt;job-driver.json&lt;/code&gt; with the following content. Replace the S3 paths with the locations where you uploaded your script and dependencies in the previous steps.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-json"&gt;{
  "sparkSubmit": {
    "entryPoint": "s3://your-bucket/scripts/heartbeat_monitor.py",
    "sparkSubmitParameters": "--jars s3://your-bucket/jars/spark-kinesis-connector.jar --archives s3://your-bucket/pyfiles/protobuf_pkg.tar.gz#protobuf_pkg --conf spark.executor.cores=4 --conf spark.executor.memory=16g --conf spark.driver.cores=4 --conf spark.driver.memory=16g --conf spark.executor.instances=3 --conf spark.sql.streaming.stateStore.providerClass=org.apache.spark.sql.execution.streaming.state.RocksDBStateStoreProvider --conf spark.sql.streaming.stateStore.rocksdb.changelogCheckpointing.enabled=true --conf spark.emr-serverless.driverEnv.PYTHONPATH=./protobuf_pkg --conf spark.executorEnv.PYTHONPATH=./protobuf_pkg"
  }
}&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Then, run the following command to submit the job. Replace the application ID and account ID with your own values.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-bash"&gt;aws emr-serverless start-job-run \
  --application-id &amp;lt;YOUR_APPLICATION_ID&amp;gt; \
  --execution-role-arn arn:aws:iam::&amp;lt;ACCOUNT_ID&amp;gt;:role/EMRServerlessJobRole \
  --job-driver file://job-driver.json \
  --mode STREAMING \
  --retry-policy maxFailedAttemptsPerHour=1 \
  --region us-east-1&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Running &lt;code&gt;transformWithState&lt;/code&gt; on Amazon EMR Serverless provides several operational advantages over self-managed Spark clusters. In streaming mode, the Spark driver remains alive between micro-batches, eliminating the overhead of repeatedly starting and stopping the application. You don’t need to provision or manage executors because EMR Serverless automatically scales compute resources up and down based on workload demands, so you only pay for what you use. Your IoT heartbeat monitor can handle traffic spikes, such as thousands of devices reconnecting simultaneously after a network outage, without manual intervention. EMR Serverless also provides built-in job resiliency, real-time monitoring, and enhanced log management, reducing the operational burden of running streaming applications in production.&lt;/p&gt; 
&lt;h2&gt;Testing the solution&lt;/h2&gt; 
&lt;p&gt;Now that our streaming application is deployed, let’s test it by sending heartbeat events and observing the offline detection behavior.&lt;/p&gt; 
&lt;h3&gt;Step 1: Open AWS CloudShell&lt;/h3&gt; 
&lt;p&gt;Open &lt;a href="https://docs.aws.amazon.com/cloudshell/latest/userguide/welcome.html" target="_blank" rel="noopener"&gt;AWS CloudShell&lt;/a&gt; in your AWS account from the AWS Management Console.&lt;/p&gt; 
&lt;h3&gt;Step 2: Send heartbeat events using CLI&lt;/h3&gt; 
&lt;p&gt;Execute the following bash script to send heartbeat events every 10s.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-bash"&gt;#!/bin/bash

while true; do
  aws kinesis put-record \
    --stream-name iot-heartbeats \
    --partition-key device-001 \
    --data $(echo "{\"device_id\":\"device-001\",\"timestamp\":\"$(date -u +%Y-%m-%dT%H:%M:%SZ)\",\"battery_level\":87.5,\"signal_strength\":-42.3,\"firmware_version\":\"v2.1.0\"}" | base64) \
    --region us-east-1

  aws kinesis put-record \
    --stream-name iot-heartbeats \
    --partition-key device-002 \
    --data $(echo "{\"device_id\":\"device-002\",\"timestamp\":\"$(date -u +%Y-%m-%dT%H:%M:%SZ)\",\"battery_level\":87.5,\"signal_strength\":-42.3,\"firmware_version\":\"v2.1.0\"}" | base64) \
    --region us-east-1

  aws kinesis put-record \
    --stream-name iot-heartbeats \
    --partition-key device-003 \
    --data $(echo "{\"device_id\":\"device-003\",\"timestamp\":\"$(date -u +%Y-%m-%dT%H:%M:%SZ)\",\"battery_level\":87.5,\"signal_strength\":-42.3,\"firmware_version\":\"v2.1.0\"}" | base64) \
    --region us-east-1

  sleep 10
done&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Update the timestamp field to use the current time for each event or use a script to automate sending events at regular intervals.&lt;/p&gt; 
&lt;h3&gt;Step 3: Observe normal operation&lt;/h3&gt; 
&lt;p&gt;As you send heartbeat events every 10 seconds, the Spark application receives each event and updates the device’s state. A timer is then registered for 30 seconds in the future. Each new heartbeat cancels the existing timer and registers a new one, effectively resetting the countdown. As long as heartbeats continue to arrive within the 30-second window, no alerts are sent.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/12/BDB-5849-2.png" alt="Timeline diagram showing normal device operation over 60 seconds with heartbeats arriving every 10 seconds, each resetting the 30-second timer" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;The above timeline diagram shows a 60-second window of normal device operation. Heartbeat events arrive every 10 seconds (at 0s, 10s, 20s, 30s, 40s, 50s, and 60s), each resetting the 30-second timer window. Because every heartbeat arrives well within the 30-second threshold, the timer never expires, the device state remains online, and no alerts are triggered.&lt;/p&gt; 
&lt;h3&gt;Step 4: Test offline detection&lt;/h3&gt; 
&lt;p&gt;Stop sending heartbeat events for the device and wait 30 seconds. You should receive an SNS alert indicating the device is offline.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/12/BDB-5849-3.png" alt="Timeline diagram showing offline detection over 110 seconds with the 30-second timer expiring and triggering SNS alerts" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;Timeline diagram showing offline detection over 110 seconds. Device sends heartbeats at 0s, 10s, and 20s before going offline. The 30-second timer expires at 50s triggering Alert #1 via SNS, followed by a repeat Alert #2 at 110s after a 60-second repeat timer.&lt;/p&gt; 
&lt;p&gt;If you continue to not send heartbeats, additional alerts will be sent every 60 seconds.&lt;/p&gt; 
&lt;h3&gt;Step 5: Test device recovery&lt;/h3&gt; 
&lt;p&gt;Resume sending heartbeat events using the same CLI command. The application will cancel all existing timers for the device and will stop sending SNS alerts.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/12/BDB-5849-4.png" alt="Timeline diagram showing device recovery lifecycle with timers canceled and device returning to online state" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;Timeline diagram showing the complete device recovery lifecycle over 140 seconds across three phases: normal operation with heartbeats, offline detection with SNS alerts, and recovery where timers are canceled and the device returns to online state&lt;/p&gt; 
&lt;h2&gt;Clean up&lt;/h2&gt; 
&lt;p&gt;To avoid incurring ongoing charges, follow these steps to clean up the resources.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Step 1: Stop the EMR serverless application&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;Stop your running streaming job:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-bash"&gt;aws emr-serverless stop-job-run \
  --application-id &amp;lt;your-application-id&amp;gt; \
  --job-run-id &amp;lt;your-job-run-id&amp;gt;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;&lt;strong&gt;Step 2: Delete the EMR serverless application&lt;/strong&gt;&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-bash"&gt;aws emr-serverless delete-application \
  --application-id &amp;lt;your-application-id&amp;gt;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;&lt;strong&gt;Step 3: Delete kinesis data stream&lt;/strong&gt;&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-bash"&gt;aws kinesis delete-stream --stream-name iot-heartbeat&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;&lt;strong&gt;Step 4: Remove S3 objects&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;Delete the checkpoint data, scripts, and dependencies from your S3 bucket:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-bash"&gt;aws s3 rm s3://your-bucket/checkpoints/ --recursive
aws s3 rm s3://your-bucket/scripts/ --recursive
aws s3 rm s3://your-bucket/jars/ --recursive
aws s3 rm s3://your-bucket/pyfiles/ --recursive&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;h2&gt;Real-world use cases for stateful streaming&lt;/h2&gt; 
&lt;p&gt;The &lt;code&gt;transformWithState&lt;/code&gt; API enables developers to build sophisticated streaming applications that were previously difficult to implement. Here are a few examples of how it can be applied across industries.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Telecommunications and network monitoring:&lt;/strong&gt; Telecom providers need to detect network anomalies and SLA violations as they happen across millions of concurrent sessions. With &lt;code&gt;transformWithState&lt;/code&gt;, developers can maintain per-session state to track call detail records, compare real-time network metrics against established baselines, and trigger alerts the moment thresholds are breached. Automatic state TTL ensures that completed session records are cleaned up without manual intervention.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Financial services and fraud detection:&lt;/strong&gt; Detecting fraud requires correlating multiple signals across a sequence of transactions in real time. With &lt;code&gt;transformWithState&lt;/code&gt;, developers can maintain per-account state that tracks transaction histories, flags suspicious patterns like rapid purchases across geographies, and calculates rolling risk scores. Multiple state variables per key allow tracking different dimensions of activity, such as transaction velocity, location changes, and spending deviations, within a single stateful operator.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;E-commerce and customer engagement:&lt;/strong&gt; Understanding customer behavior in real time is critical for driving conversions. Using &lt;code&gt;transformWithState&lt;/code&gt;, developers can build session-aware applications that track browsing and cart activity with timer-based state expiration, detecting cart abandonment after a configurable timeout and triggering personalized re-engagement notifications. The State Data Source Reader enables teams to inspect session state mid-stream, making it easier to debug and validate real-time customer journey logic.&lt;/p&gt; 
&lt;h2&gt;Conclusion&lt;/h2&gt; 
&lt;p&gt;Apache Spark 4.0’s &lt;code&gt;transformWithState&lt;/code&gt; API represents a significant advancement in stateful stream processing, making it simpler to build complex real-time applications like IoT device monitoring. Combined with Amazon EMR Serverless, you get a fully managed platform that scales automatically and eliminates infrastructure management overhead.&lt;/p&gt; 
&lt;p&gt;This post demonstrates how to use the native timer support capability of &lt;code&gt;transformWithState&lt;/code&gt; to build a real-time IoT device monitoring application. We encourage you to explore other capabilities such as Automatic State TTL, Schema Evolution, and Multiple State Variables on Amazon EMR Serverless to build more sophisticated streaming applications tailored to your needs.&lt;/p&gt; 
&lt;hr&gt; 
&lt;h2&gt;About the authors&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignleft size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/12/BDB-5849-5.jpg" alt="Raj Ramasubbu" width="100" height="100"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Raj Ramasubbu&lt;/h3&gt; 
  &lt;p&gt;Raj Ramasubbu is a Senior Specialist Solutions Architect for Analytics and AI at AWS. He partners with ISV customers to design and implement modern data platforms that balance performance, cost efficiency, and operational resilience at scale. With over two decades of experience spanning data engineering, advanced analytics, and machine learning across industries such as healthcare, financial services, and retail, Raj brings a practitioner’s perspective to solving complex data challenges in the cloud.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignleft size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/12/BDB-5849-6.jpg" alt="Rekha Veeraraghavan" width="100" height="100"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Rekha Veeraraghavan&lt;/h3&gt; 
  &lt;p&gt;Rekha Veeraraghavan is a Technical Account Manager at Amazon Web Services (AWS). She serves as a Subject Matter Expert in AWS Analytics services, specializing in AWS Glue and Amazon Athena. Rekha provides expert guidance and technical support to enterprise and strategic customers, helping them optimize data analytics solutions. With deep expertise in data engineering, she enables organizations to build scalable, efficient, and cost-effective data processing pipelines on AWS.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignleft size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/12/BDB-5849-7.jpg" alt="Praveen Krishnamoorthy Ravikumar" width="100" height="100"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Praveen Krishnamoorthy Ravikumar&lt;/h3&gt; 
  &lt;p&gt;Praveen Krishnamoorthy Ravikumar is an Analytics Specialist Solutions Architect at AWS. He helps customers design and implement modern data and analytics platforms that leverage the scalability, flexibility, and innovation of the cloud. He is passionate about solving complex data challenges and enabling organizations to unlock actionable insights from their data.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
		
		
			</item>
		<item>
		<title>Announcing general availability of Apache Spark 4.0 on Amazon EMR</title>
		<link>https://aws.amazon.com/blogs/big-data/announcing-general-availability-of-apache-spark-4-0-on-amazon-emr/</link>
		
		<dc:creator><![CDATA[Suthan Phillips]]></dc:creator>
		<pubDate>Tue, 09 Jun 2026 16:13:21 +0000</pubDate>
				<category><![CDATA[Amazon EMR]]></category>
		<category><![CDATA[Announcements]]></category>
		<category><![CDATA[Intermediate (200)]]></category>
		<guid isPermaLink="false">25b5518f79147fc97b5fc52edcf7f2569f6ee391</guid>

					<description>With this general availability announcement, Spark 4.0 is now supported across Amazon EMR Serverless, Amazon EMR on EC2, and Amazon EMR on EKS deployment options. In this post, you’ll learn about key Spark 4.0 capabilities now available on Amazon EMR including Spark Connect, the Variant data type, SQL scripting, Python API improvements, and streaming enhancements, along with infrastructure changes in the new emr-spark-8.0 release.</description>
										<content:encoded>&lt;p&gt;As data volumes grow and pipelines become more complex, you need an engine that handles semi-structured data natively, supports streaming state without operational overhead, and allows you to develop interactively against production-scale compute. Spark 4.0 addresses these three challenges that slow modern data teams: wrangling semi-structured data, managing streaming state, and bridging the gap between interactive development and production-scale execution. With VARIANT data type, state-management improvements, and Spark Connect availability in Spark 4.0, you can now handle these workloads with less code, fewer operational trade-offs, and faster iteration cycles, all on Amazon EMR optimized runtime, which runs Spark workloads up to 4.5× faster than open-source &lt;a href="https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark.html" target="_blank" rel="noopener"&gt;Apache Spark&lt;/a&gt;.&lt;/p&gt; 
&lt;p&gt;With this general availability announcement, Spark 4.0 is now supported across &lt;a href="https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless.html" target="_blank" rel="noopener"&gt;Amazon EMR Serverless&lt;/a&gt;, Amazon &lt;a href="https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-overview.html" target="_blank" rel="noopener"&gt;EMR on EC2&lt;/a&gt;, and &lt;a href="https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/emr-eks.html" target="_blank" rel="noopener"&gt;Amazon EMR on EKS&lt;/a&gt; deployment options. In this post, you’ll learn about key Spark 4.0 capabilities now available on Amazon EMR including Spark Connect, the Variant data type, SQL scripting, Python API improvements, and streaming enhancements, along with infrastructure changes in the new emr-spark-8.0 release.&lt;/p&gt; 
&lt;h2&gt;New features in GA&lt;/h2&gt; 
&lt;p&gt;Apache Spark 4.0 introduces several capabilities that are now generally available on Amazon EMR.&lt;/p&gt; 
&lt;h2&gt;Spark Connect&lt;/h2&gt; 
&lt;p&gt;Most Spark development is iterative and disconnected from production. You write code locally, test it against a sample, then package and deploy it to a cluster. It often fails due to data issues at scale, environment mismatches, or dependency conflicts. The feedback loop is slow, and the gap between development and production is where most time is lost.&lt;/p&gt; 
&lt;p&gt;&lt;a href="https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/spark-connect.html" target="_blank" rel="noopener"&gt;Spark Connect&lt;/a&gt; closes that gap by introducing a decoupled client-server architecture that changes how your application communicates with Spark. In previous versions, your application code and the Spark driver ran inside the same JVM process, meaning issues in your application code could destabilize the Spark driver and disrupt the entire session. Your application runs as a lightweight client that submits logical plans to a Spark server over gRPC. The server handles execution independently. Your client doesn’t require a local Spark installation, a JVM, and doesn’t need to run on a cluster node. It only needs connectivity to the server endpoint.&lt;/p&gt; 
&lt;p&gt;With Amazon EMR, this means you can write PySpark from your preferred IDE (VS Code, PyCharm), Jupyter notebooks, Amazon SageMaker Unified Studio Data Notebooks, Amazon Q Developer, or Kiro, and Spark Connect routes your DataFrame transformations and SQL queries to Amazon EMR for execution over a secure connection.You can set breakpoints, inspect variables, and step through transformations while your data is processed on serverless compute, catching issues during development instead of after deployment. There are no clusters to provision, no code to repackage, and no infrastructure to manage.&lt;/p&gt; 
&lt;p&gt;This architecture also improves session resilience. A client-side failure doesn’t affect the Spark server, so other workloads continue to run without disruption. Spark Connect is an open Apache Spark standard. The same PySpark code works across different Spark backends by changing the connection endpoint.&lt;/p&gt; 
&lt;p&gt;For example, connecting to Amazon EMR Serverless from your local IDE takes minimal lines of spark code:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-python"&gt;from pyspark.sql import SparkSession
spark = SparkSession.builder \
    .remote("sc://&amp;lt;endpoint&amp;gt;:443/;use_ssl=true;x-aws-proxy-auth=") \
    .getOrCreate()
df = spark.sql("SELECT * FROM my_catalog.my_database.my_table")
df.groupBy("category").count().show()&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;On Amazon EMR Serverless, start a session to retrieve your endpoint and auth token, then connect remotely using the standard &lt;code&gt;sc://&lt;/code&gt; protocol. Every Spark operation executes on Amazon EMR Serverless while your code stays local.&lt;/p&gt; 
&lt;p&gt;The following video showcases Spark Connect and Variant features together.&lt;/p&gt; 
&lt;div style="width: 640px;" class="wp-video"&gt;
 &lt;video class="wp-video-shortcode" id="video-91417-3" width="640" height="360" preload="metadata" controls="controls"&gt;
  &lt;source type="video/mp4" src="https://d2908q01vomqb2.cloudfront.net/artifacts/DBSBlogs/BDB-5882/emrsparkconnectvariantdemo.mp4?_=3"&gt;
 &lt;/video&gt;
&lt;/div&gt; 
&lt;p&gt;For a step-by-step getting-started walkthrough, visit &lt;a href="https://aws.amazon.com/blogs/big-data/announcing-spark-connect-on-amazon-emr-serverless-interactive-pyspark-development-anywhere/" target="_blank" rel="noopener"&gt;Announcing Spark Connect on Amazon EMR Serverless: Interactive PySpark development, anywhere&lt;/a&gt;.&lt;/p&gt; 
&lt;h2&gt;Data type and table format enhancements&lt;/h2&gt; 
&lt;p&gt;This section covers the VARIANT data type and Apache Iceberg V3 support. These two additions improve how you store and query semi-structured data.&lt;/p&gt; 
&lt;h3&gt;Apache Iceberg V3 support&lt;/h3&gt; 
&lt;p&gt;Amazon EMR has supported Apache Iceberg V3 since Amazon EMR release 7.x, introducing capabilities such as deletion vectors and row lineage. With Spark 4.0 on Amazon EMR, that support deepens unlocking capabilities that had a hard dependency on Spark 4.0 itself, including VARIANT column storage and unknown type handling. For teams running data lakehouse workloads, the table format underneath your data determines how efficiently it is stored, how reliably it evolves, and how safely multiple tools can read and write it simultaneously.&lt;/p&gt; 
&lt;p&gt;What this means for your workloads:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;VARIANT and Iceberg working together:&lt;/strong&gt; VARIANT columns can now be stored natively in Iceberg V3 tables, combining efficient semi-structured data storage with Iceberg’s schema evolution and time travel capabilities. This eliminates the pipeline complexity of upfront schema definitions.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;More efficient partitioning:&lt;/strong&gt; Multi-argument transforms accept multiple input columns in a single partition expression, such as range (order_date, product_category), giving you finer control over data layout. They produce a single composite key instead of separate columns whose cartesian product can explode partition count. The result is less data scanned, faster queries, and lower compute costs for high volume workloads.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Safer schema evolution:&lt;/strong&gt; Unknown type handling ensures that older readers do not break when newer writers introduce new column types, reducing coordination overhead across teams and tools during upgrades.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Fine-grained access control (FGAC):&lt;/strong&gt; Column-level and row-level permissions are now available through AWS Lake Formation, giving you governed access control at a granular level across your Iceberg tables, no custom access logic required.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h3&gt;Variant data type&lt;/h3&gt; 
&lt;p&gt;The new VARIANT data type, supported through Apache Iceberg v3, brings native support for semi-structured JSON data directly into Spark SQL. This matters most when you don’t control the data being written because platform teams and shared services often receive data from partners and upstream teams with unpredictable or evolving structures.&lt;/p&gt; 
&lt;p&gt;Without VARIANT, handling semi-structured data meant accepting real tradeoffs: defining schemas upfront that broke when data evolved, storing everything as strings with heavy parsing costs on every read, or building wide tables with nullable columns that wasted storage on empty fields. The most realistic option was breaking nested structures apart into separate columns before running queries. This ETL step added latency, increased storage costs, and broke every time an upstream team added or removed fields from their data feed.&lt;/p&gt; 
&lt;p&gt;VARIANT eliminates the process entirely. Data stays nested and is queryable with &lt;code&gt;variant_get()&lt;/code&gt;, without a separate ETL pipeline. You ingest without defining a schema first and apply structure at query time.&lt;/p&gt; 
&lt;p&gt;For example, querying nested fields is now a single expression:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-sql"&gt;SELECT
    variant_get(payload, '$.user.name') AS user_name,
    variant_get(payload, '$.event.type') AS event_type,
    variant_get(payload, '$.event.timestamp') AS event_timestamp
FROM VALUES
    (PARSE_JSON('{"user":{"name":"Alice"},"event":{"type":"click","timestamp":"2025-03-01"}}'))
AS t(payload)
WHERE variant_get(payload, '$.event.timestamp') &amp;gt; '2025-01-01';&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;For a deep dive into how VARIANT is stored in Parquet, shredding mechanics, and a full end-to-end walkthrough on Amazon EMR Serverless, see &lt;a href="https://aws.amazon.com/blogs/big-data/beyond-json-blobs-implementing-the-variant-data-type-in-apache-iceberg-v3/" target="_blank" rel="noopener"&gt;Beyond JSON blobs: Implementing the VARIANT data type in Apache Iceberg V3&lt;/a&gt;.&lt;/p&gt; 
&lt;p&gt;Key benefits for your workloads:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Reduced pipeline fragility:&lt;/strong&gt; Schema changes no longer break ingestion. Data lands as-is, and you apply structure at query time based on what each analysis needs, without upstream coordination.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Improved query performance:&lt;/strong&gt; Optimized storage format enables efficient access to nested fields without parsing overhead, so queries run faster even on deeply nested payloads.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Better storage efficiency:&lt;/strong&gt; Compact encoding eliminates the waste of NULL-heavy wide tables, reducing storage costs for semi-structured data at scale.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;VARIANT is especially well-suited where schemas are unpredictable or evolving: IoT and sensor data with device-specific payloads, logging and telemetry with variable event structures, and API responses and webhooks from third-party services where the schema changes without notice.&lt;/p&gt; 
&lt;h2&gt;SQL enhancements&lt;/h2&gt; 
&lt;p&gt;You can now write and maintain Spark pipelines using the same standard SQL you already know, no Spark-specific functions or syntax required. Apache Spark 4.0 expands ANSI SQL compliance so that functions behave consistently, opening Spark to anyone who can write SQL rather than requiring Spark specialists.&lt;/p&gt; 
&lt;p&gt;Standard SQL syntax such as &lt;code&gt;OFFSET&lt;/code&gt;, &lt;code&gt;LIMIT ... OFFSET&lt;/code&gt;, and lateral column aliases now work as expected. For example:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-sql"&gt;-- Standard OFFSET syntax now supported
SELECT id, name
FROM VALUES (1, 'Alice'), (2, 'Bob'), (3, 'Carol'), (4, 'Dave') AS t(id, name)
ORDER BY id
LIMIT 2 OFFSET 1;

-- Lateral column aliases work inline
SELECT amount * 1.1 AS adjusted, adjusted * 0.08 AS tax
FROM VALUES (100.0), (200.0), (350.0) AS t(amount);&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Beyond syntax, SQL scripting brings procedural logic directly into Spark SQL. You can now use variables, IF/ELSE conditionals, loops, and multi-statement blocks without switching to Python or JVM-based languages. Before SQL scripting, multi-step workflows (such as ETL logic with conditional branching or iterative data quality checks) required wrapping SQL statements in Python or Scala to handle control flow. SQL scripting removes that dependency. SQL-native teams can author and maintain these workflows entirely in SQL.&lt;/p&gt; 
&lt;p&gt;Key benefits:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Simplified ETL workflows:&lt;/strong&gt; Multi-step transformation logic that previously required an external language can now live entirely in SQL, reducing code complexity and making pipelines easier to build and maintain.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Lower barrier for SQL-native teams:&lt;/strong&gt; Teams that primarily work in SQL no longer need to context-switch into Python or Scala to implement conditional logic or iterative processing. The entire pipeline stays in SQL.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;Python advances&lt;/h2&gt; 
&lt;p&gt;Earlier versions of Spark required Python users to step outside Python at two key points: building custom data connectors required Java or Scala, and diagnosing UDF performance lacked built-in visibility. Spark 4.0 addresses both directly, removing the two biggest blockers for organizations where Python is the primary language.&lt;/p&gt; 
&lt;h3&gt;Python Data Source API&lt;/h3&gt; 
&lt;p&gt;With the Python Data Source API, you can build custom, reusable data connectors in Python without any JVM or Scala code. Custom connectors participate in Spark’s query optimization, including predicate pushdown and schema inference. This matters when your data system only has a Python client, or when your team does not have Java or Scala expertise: you can now wrap any custom format or external source as a Spark DataFrame source or sink without leaving Python.&lt;/p&gt; 
&lt;p&gt;Spark 4.0 also introduces polymorphic Python UDTFs (User-Defined Table Functions) that can return different schema shapes depending on input, with an &lt;code&gt;analyze()&lt;/code&gt; method that produces a schema dynamically based on parameters. This is particularly useful for processing varying JSON schemas or splitting inputs into a variable set of outputs.&lt;/p&gt; 
&lt;p&gt;If you’re ingesting data from a REST API with a Python client, you can implement a custom Spark data source entirely in Python, register it, and use it directly in Spark SQL or the DataFrame API. What previously required a Scala developer and a custom JVM connector can now be built and maintained by your Python team running the pipeline.&lt;/p&gt; 
&lt;h3&gt;Python UDF enhancement&lt;/h3&gt; 
&lt;p&gt;Python UDF profiling provides built-in visibility into execution time, serialization overhead, and memory usage at the individual UDF level without external tooling. Additionally, it enables performance or memory profiling depending on what you need to diagnose.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Arrow-based vectorized UDF support&lt;/strong&gt; reduces serialization overhead between Python and the JVM using a columnar format, replacing row-at-a-time processing with batch-oriented columnar exchange.&lt;/p&gt; 
&lt;p&gt;Together, these give you a complete optimization loop: build custom connectors in Python, profile your UDF performance, and optimize with confidence.&lt;/p&gt; 
&lt;p&gt;Practical benefits for Python teams:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Lower barrier for Python teams:&lt;/strong&gt; Custom data connectors no longer require Java or Scala knowledge. If your data system has a Python client, you can build a production-grade Spark connector entirely in Python.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Flexible data transformation:&lt;/strong&gt; Polymorphic UDTFs let your functions adapt to varying input schemas dynamically, reducing the need to write and maintain multiple transformation functions for different data shapes.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Faster UDF optimization:&lt;/strong&gt; Built-in profiling surfaces exactly where execution time and memory are being spent at the UDF level, replacing guesswork with direct visibility and making performance tuning significantly faster.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;Streaming enhancements&lt;/h2&gt; 
&lt;p&gt;This section covers improvements to state management and observability in structured streaming workloads.&lt;/p&gt; 
&lt;h3&gt;Queryable state for structured streaming&lt;/h3&gt; 
&lt;p&gt;Structured streaming jobs maintain state continuously (running totals, session windows, aggregated counts). However, in earlier versions of Spark that state was locked inside the running job. Inspecting it meant stopping the stream or manually parsing checkpoint files. For production workloads, this created real operational risk: teams had no way to verify whether state was correct, corrupted, or drifting without taking the job down.&lt;/p&gt; 
&lt;p&gt;Time-sensitive applications faced an additional problem: timers in Spark streaming only fired when new data arrived, so a five-minute heartbeat timeout could silently miss its window if no data came in, making applications like heartbeat monitoring and session tracking unreliable by design.&lt;/p&gt; 
&lt;p&gt;Spark 4.0 changes this. The new &lt;code&gt;transformWithState&lt;/code&gt; API provides deterministic timer execution because timers fire on schedule regardless of data arrival patterns. It also delivers automatic state TTL to prevent unbounded growth, schema evolution without restarting from a new checkpoint, and state observability for mid-stream debugging. External systems can now read live aggregated state from a running streaming job without interrupting it. State is accessible as a DataFrame, queryable during development, verifiable in unit tests, and inspectable during production incidents without touching the running stream.&lt;/p&gt; 
&lt;p&gt;This is backed by three improvements working together. First, the &lt;code&gt;transformWithState&lt;/code&gt; operator replaces &lt;code&gt;mapGroupsWithState&lt;/code&gt; from earlier Spark versions (which had limited timer support and no TTL-based cleanup). Second, the state data source reader exposes streaming state as a queryable DataFrame. Lastly, RocksDB changelog checkpointing improvements address throughput bottlenecks in high-volume stateful workloads.&lt;/p&gt; 
&lt;p&gt;Consider a fleet of 100,000 IoT sensors across manufacturing facilities, each requiring an alert within 30 seconds of going offline. The sensors track heartbeat state per device, managing independent timers, handling late data, and cleaning up decommissioned devices at scale had no clean solution in earlier Spark versions. The &lt;code&gt;transformWithState&lt;/code&gt; operator handles all of this natively, and queryable state lets your operations team inspect live device state in real time without stopping the stream:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-python"&gt;# Timers fire on schedule regardless of data arrival, making heartbeat monitoring reliable
alerts = events_df.groupBy("device_id").transformWithState(
    HeartbeatMonitor(),
    outputStructType=StructType([
        StructField("device_id", StringType()),
        StructField("alert", StringType())
    ]),
    outputMode="Append"
)&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Combined with Amazon EMR Serverless, which scales compute automatically based on workload demands, you can deploy stateful streaming pipelines without managing clusters or predicting capacity.&lt;/p&gt; 
&lt;p&gt;Benefits:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Real-time operational visibility:&lt;/strong&gt; Live streaming state is now accessible externally without interrupting the job, powering dashboards and monitoring systems that reflect current aggregations.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Faster debugging:&lt;/strong&gt; State values can be queried directly as a DataFrame, making it significantly easier to diagnose production incidents and verify correctness during development.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Better performance at scale:&lt;/strong&gt; RocksDB checkpointing improvements reduce bottlenecks in high-throughput stateful workloads, improving reliability for long-running streaming jobs.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;What’s new in the emr-spark-8.0 release&lt;/h2&gt; 
&lt;p&gt;Beyond the Spark 4.0 capabilities covered in the preceding sections, the emr-spark-8.0 release introduces infrastructure and runtime changes that simplify how you deploy, patch, and manage Amazon EMR workloads. The release focuses exclusively on Spark, reducing the surface area you need to patch and test.&lt;/p&gt; 
&lt;h3&gt;Fewer components to patch and test&lt;/h3&gt; 
&lt;p&gt;The &lt;code&gt;emr-spark-8.0&lt;/code&gt; release includes Apache Spark 4.0, Apache Iceberg 1.10, Apache Hudi 1.0.2, Delta Lake 4.0, and connectors for Amazon DynamoDB, Amazon Kinesis, Amazon Redshift, and Amazon Simple Storage Service (Amazon S3) (via the S3A connector). Apache Livy and JupyterEnterpriseGateway are available as opt-in components on Amazon EMR on EC2. If your workloads require Apache Flink, Trino, Presto, or other execution engines, you can continue to use Amazon EMR 7.x releases.&lt;/p&gt; 
&lt;h3&gt;Simplified patch management&lt;/h3&gt; 
&lt;p&gt;You can specify &lt;code&gt;emr-spark-8.0.x&lt;/code&gt; when creating a cluster or application, and Amazon EMR will automatically select the latest patch version. For example, &lt;code&gt;emr-spark-8.0.1&lt;/code&gt;, &lt;code&gt;emr-spark-8.0.2&lt;/code&gt;, and so on as patches are released. This “.x” wildcard is supported through AWS APIs and AWS Command Line Interface (AWS CLI). On Amazon EMR on EKS and Amazon EMR Serverless, new jobs automatically run on the latest Amazon Linux patches, so you no longer need to track date-based version tags.&lt;/p&gt; 
&lt;h3&gt;Latest Python, Java, and Scala runtimes&lt;/h3&gt; 
&lt;p&gt;The release ships with modernized runtimes: Python 3.11 as the default, with support for Python 3.12 and 3.13. Java 17 is the default, with Java 21 also available. Both are provided through &lt;a href="https://aws.amazon.com/corretto/" target="_blank" rel="noopener"&gt;Amazon Corretto&lt;/a&gt;. Scala 2.13 is the supported Scala runtime.&lt;/p&gt; 
&lt;p&gt;A few infrastructure changes to note: AWS SDK for Java v2 replaces v1, bringing improved performance and alignment with the latest AWS APIs. The EMR S3A connector replaces EMR File Systems (EMRFS) for Amazon S3 access, delivering better performance and compatibility with open-source Spark. For shuffle-intensive workloads on Amazon EMR Serverless, enabling Serverless Storage can reduce data processing costs by up to 20%. For more information, see &lt;a href="https://aws.amazon.com/blogs/big-data/optimize-amazon-emr-runtime-for-apache-spark-with-emr-s3a/" target="_blank" rel="noopener"&gt;Optimize Amazon EMR Runtime for Apache Spark with EMR S3A for benchmarks&lt;/a&gt;, &lt;a href="https://aws.amazon.com/blogs/big-data/amazon-emr-serverless-eliminates-local-storage-provisioning-reducing-data-processing-costs-by-up-to-20/" target="_blank" rel="noopener"&gt;Amazon EMR Serverless eliminates local storage provisioning&lt;/a&gt;, and &lt;a href="https://aws.amazon.com/blogs/big-data/reducing-costs-for-shuffle-heavy-apache-spark-workloads-with-serverless-storage-for-amazon-emr-serverless/" target="_blank" rel="noopener"&gt;Reducing costs for shuffle-heavy Apache Spark workloads with serverless storage for Amazon EMR Serverless&lt;/a&gt;.&lt;/p&gt; 
&lt;h3&gt;Migration and compatibility notes&lt;/h3&gt; 
&lt;p&gt;If you are migrating from Spark 3.5 to Spark 4.0, the &lt;a href="https://docs.aws.amazon.com/emr/latest/ReleaseGuide/spark-upgrades.html" target="_blank" rel="noopener"&gt;Apache Spark upgrade agent for Amazon EMR&lt;/a&gt; can accelerate your migration by analyzing existing applications and identifying changes needed for Spark 4.0 compatibility. For more information, &lt;a href="https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-upgrade-agent-setup.html" target="_blank" rel="noopener"&gt;see the upgrade guidance&lt;/a&gt;.&lt;/p&gt; 
&lt;p&gt;If your workflows use Apache Pig, Apache Oozie, JupyterHub, Apache Zeppelin, or Hue, you can continue to use Amazon EMR 7.x releases. These components are not included in &lt;code&gt;emr-spark-8.0&lt;/code&gt;. For interactive Spark development, use Amazon EMR Studio, with Apache Livy and JupyterEnterpriseGateway available on Amazon EMR on EC2.&lt;/p&gt; 
&lt;p&gt;For the complete list of supported components and configurations, see the &lt;a href="https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-whatsnew.html" target="_blank" rel="noopener"&gt;Amazon EMR release guide&lt;/a&gt;.&lt;/p&gt; 
&lt;h2&gt;Get started&lt;/h2&gt; 
&lt;p&gt;Spark 4.0 is now available across Amazon EMR on EC2, Amazon EMR on EKS, and Amazon EMR Serverless. To begin, choose your deployment model and follow the relevant getting started guide:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Amazon EMR on EC2 — &lt;a href="https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-gs.html" target="_blank" rel="noopener"&gt;Getting started with Amazon EMR on EC2&lt;/a&gt;.&lt;/li&gt; 
 &lt;li&gt;Amazon EMR Serverless — &lt;a href="https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/setting-up.html" target="_blank" rel="noopener"&gt;Getting started with Amazon EMR Serverless&lt;/a&gt;.&lt;/li&gt; 
 &lt;li&gt;Amazon EMR on EKS — &lt;a href="https://docs.aws.amazon.com/emr/latest/EMR-on-EKS-DevelopmentGuide/getting-started.html" target="_blank" rel="noopener"&gt;Getting started with Amazon EMR on EKS&lt;/a&gt;.&lt;/li&gt; 
 &lt;li&gt;Amazon EMR release guide — &lt;a href="https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark.html" target="_blank" rel="noopener"&gt;Apache Spark&lt;/a&gt;.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;Conclusion&lt;/h2&gt; 
&lt;p&gt;Spark 4.0 on Amazon EMR delivers improvements across query validation, semi-structured data handling, Python development, and streaming observability. ANSI SQL mode catches invalid operations at query time rather than silently propagating nulls downstream, and SQL scripting removes the need to context-switch between SQL and Python for complex ETL logic. The VARIANT data type eliminates parsing overhead for semi-structured JSON workloads and can now be stored natively in Iceberg V3 tables with fine-grained access control at the column and row level. Queryable streaming state gives you live visibility into running jobs without interruption, and Spark Connect lets you develop against Amazon EMR from Jupyter notebooks, Amazon SageMaker Unified Studio Data Notebooks, Amazon Q Developer, Kiro, or your preferred IDE without managing cluster connectivity.&lt;/p&gt; 
&lt;p&gt;Ready to build or migrate? Choose your deployment model from the preceding section and get started today. For detailed guidance, see the &lt;a href="https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-whatsnew.html" target="_blank" rel="noopener"&gt;Amazon EMR Release Guide&lt;/a&gt; and the &lt;a href="https://docs.aws.amazon.com/emr/latest/EMR-Serverless-UserGuide/emr-serverless.html" target="_blank" rel="noopener"&gt;Amazon EMR Serverless User Guide&lt;/a&gt;.&lt;/p&gt; 
&lt;hr&gt; 
&lt;h2&gt;About the authors&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignleft size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/01/21/suthan_photo.png" alt="Suthan Phillips" width="100" height="100"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Suthan Phillips&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/suthan-phillips/" target="_blank" rel="noopener"&gt;Suthan&lt;/a&gt; is a Senior Analytics Architect at AWS, where he helps customers design and optimize scalable, high-performance data solutions that drive business insights. He combines architectural guidance on system design and scalability with best practices to provide efficient, secure implementation across data processing and experience layers. Outside of work, Suthan enjoys swimming, hiking, and exploring the Pacific Northwest.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignleft size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/20/bdb5882a3.png" alt="Karthik Prabhakar" width="100" height="100"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Karthik Prabhakar&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/karthik-prabhakar/" target="_blank" rel="noopener"&gt;Karthik&lt;/a&gt; is a Data Processing Engines Architect for Amazon EMR at AWS. He specializes in distributed systems architecture and query optimization, working with customers to solve complex performance challenges in large-scale data processing workloads. His focus spans engine internals, cost optimization strategies, and architectural patterns that enable customers to run petabyte-scale analytics efficiently.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignleft size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/20/bdb5882a2.jpg" alt="Kiki Nwangwu" width="100" height="100"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Kiki Nwangwu&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/nkeirunwangwuokafor/" target="_blank" rel="noopener"&gt;Kiki&lt;/a&gt; is an Analytics and GenAI Specialist Solutions Architect at AWS. She specializes in helping customers architect, build, and modernize scalable data analytics and generative AI solutions. She enjoys traveling and exploring new cultures.&amp;nbsp;(edited)&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignleft size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2020/12/23/Al-MS.jpg" alt="AI MS" width="100" height="100"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;AI MS&lt;/h3&gt; 
  &lt;p&gt;Al is a product manager for Amazon EMR at Amazon Web Services.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
		
		<enclosure length="24068829" type="video/mp4" url="https://d2908q01vomqb2.cloudfront.net/artifacts/DBSBlogs/BDB-5882/emrsparkconnectvariantdemo.mp4"/>

			</item>
		<item>
		<title>Unlock cost savings with incremental snapshot billing for Amazon Redshift Serverless and Amazon Redshift RG</title>
		<link>https://aws.amazon.com/blogs/big-data/unlock-cost-savings-with-incremental-snapshot-billing-for-amazon-redshift-serverless-and-amazon-redshift-rg/</link>
					
		
		<dc:creator><![CDATA[Nidhi Nayak]]></dc:creator>
		<pubDate>Mon, 08 Jun 2026 17:24:53 +0000</pubDate>
				<category><![CDATA[Amazon Redshift]]></category>
		<category><![CDATA[Announcements]]></category>
		<guid isPermaLink="false">fef7d8b93870abd48b0443643b2184c84ff25935</guid>

					<description>Starting June 8, 2026, Amazon Redshift is introducing an incremental snapshot billing model for Amazon Redshift Serverless and Amazon Redshift RG (provisioned instances powered by AWS Graviton). With this enhancement, you pay only for the unique data blocks across your active manual snapshots within your account. This delivers significant cost savings for customers who have multiple snapshots that contain largely identical data blocks. In this post, you will learn how the new incremental snapshot billing model works, the customer use cases it addresses, and how it helps you optimize costs while improving your Recovery Point Objective (RPO).</description>
										<content:encoded>&lt;p&gt;&lt;a href="https://aws.amazon.com/redshift/?trk=d8ec3b19-0f37-4f8c-8c12-189f913e205c&amp;amp;sc_channel=el" target="_blank" rel="noopener"&gt;Amazon Redshift&lt;/a&gt; customers rely heavily on &lt;a href="https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-snapshots.html" target="_blank" rel="noopener"&gt;snapshots&lt;/a&gt;, which are point-in-time backups of their data, for disaster recovery, compliance retention, and data portability across AWS Regions. Amazon Redshift supports two types of snapshots: automated and manual. For &lt;a href="https://docs.aws.amazon.com/redshift/latest/mgmt/overview.html" target="_blank" rel="noopener"&gt;provisioned clusters&lt;/a&gt;, &lt;a href="https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-snapshots.html#about-automated-snapshots" target="_blank" rel="noopener"&gt;automated snapshots&lt;/a&gt; are enabled by default and retained for up to 35 days; &lt;a href="https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-snapshots.html#about-manual-snapshots" target="_blank" rel="noopener"&gt;manual snapshots&lt;/a&gt; persist until you delete them. For &lt;a href="https://aws.amazon.com/redshift/redshift-serverless/" target="_blank" rel="noopener"&gt;serverless&lt;/a&gt; workgroups, Amazon Redshift automatically creates &lt;a href="https://docs.aws.amazon.com/redshift/latest/mgmt/serverless-snapshots-recovery-points.html" target="_blank" rel="noopener"&gt;recovery points&lt;/a&gt; that are retained for 24 hours, and you can also create manual snapshots with a configurable retention period. For details on snapshot creation and backup storage pricing, you can refer to &lt;a href="https://aws.amazon.com/redshift/pricing/" target="_blank" rel="noopener"&gt;Amazon Redshift pricing&lt;/a&gt; for more details.&lt;/p&gt; 
&lt;p&gt;Starting June 8, 2026, Amazon Redshift is introducing an incremental snapshot billing model for &lt;a href="https://aws.amazon.com/redshift/redshift-serverless/" target="_blank" rel="noopener"&gt;Amazon Redshift Serverless&lt;/a&gt; and &lt;a href="https://aws.amazon.com/redshift/features/rg/?trk=d8ec3b19-0f37-4f8c-8c12-189f913e205c&amp;amp;sc_channel=el" target="_blank" rel="noopener"&gt;Amazon Redshift RG (provisioned instances powered by AWS Graviton)&lt;/a&gt;. With this enhancement, you pay only for the unique data blocks across your active manual snapshots within your account. This delivers significant cost savings for customers who have multiple snapshots that contain largely identical data blocks.&lt;/p&gt; 
&lt;p&gt;In this post, you will learn how the new incremental snapshot billing model works, the customer use cases it addresses, and how it helps you optimize costs while improving your Recovery Point Objective (RPO).&lt;/p&gt; 
&lt;h2&gt;&lt;strong&gt;Incremental snapshot billing&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;With this new billing model, Amazon Redshift bills manual snapshots based on unique data blocks. When you take multiple manual snapshots of the same workgroup or cluster, much of the data remains unchanged between snapshots. The billing model recognizes this overlap and charges only for the unique data blocks across your active snapshots. Data that has not changed between snapshots is counted once.&lt;/p&gt; 
&lt;p&gt;Consider a 10 TB data warehouse with three manual snapshots:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Snapshot 1 (Day 1):&lt;/strong&gt; Full backup, 10 TB of unique data blocks&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Snapshot 2 (seconds later):&lt;/strong&gt; Nothing changed, shares data blocks with Snapshot 1, no additional charge&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Snapshot 3 (two days later):&lt;/strong&gt; 1 TB of new unique data blocks created from changes&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Total billed: 11 TB of unique data blocks&lt;/strong&gt;&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;Using this example, customers pay for the 10 TB of unique data blocks in Snapshot 1 plus the 1 TB of new unique data block in Snapshot 3. Snapshot 2 shares its blocks with Snapshot 1, so it adds zero cost. Hence, total 11 TB of unique data blocks are billed&lt;strong&gt;.&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Key billing model details&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;With the new incremental snapshot billing model, you are charged only for the unique data blocks at the existing snapshot rates. Following are the key details of the new feature:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Scope: &lt;/strong&gt;Amazon Redshift Serverless and Amazon Redshift RG instances. Amazon Redshift RA3 instances retains the current tiered S3 billing.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Rate:&lt;/strong&gt; Based on the existing snapshot pricing for your Region.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Deduplication level: &lt;/strong&gt;Account-level for Amazon Redshift Serverless and RG.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Automated snapshots: &lt;/strong&gt;Unchanged, still available at no additional cost (35 days for Provisioned, 1 day for Serverless).&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Existing snapshots: &lt;/strong&gt;Automatically transition to the incremental snapshot billing model. No action required.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;This model is especially valuable for customers needing backup retention beyond the automated snapshot windows available at no additional cost. Serverless customers needing backup beyond 24 hours can now take manual snapshots knowing they pay for a unique data block, making extended retention more practical and affordable.&lt;/p&gt; 
&lt;h2&gt;&lt;strong&gt;Benefits &lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;With the incremental snapshot billing model, customer can adopt stronger data protection strategies at optimized costs:&lt;/p&gt; 
&lt;h3&gt;&lt;strong&gt;Compliance-driven long-term retention&lt;/strong&gt;&lt;/h3&gt; 
&lt;p&gt;Regulated industries (financial services, healthcare, government, and life sciences) must often retain backups for 90 days to 5+ years. Since this billing model charges only for unique data blocks, retention policies become significantly more affordable as snapshots accumulate.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;How this feature helps:&lt;/strong&gt; You can now maintain backup retention (90-day, 1-year, 7-year) on Amazon Redshift Serverless and RG at optimized cost. A 10 TB warehouse with 5% daily change rate retaining 90 days of daily snapshots pays for ~14.5 TB of unique data blocks total across all snapshots.&lt;/p&gt; 
&lt;h3&gt;&lt;strong&gt;Disaster recovery with better Recovery Point and Time Objectives (RPO/ RTO)&lt;/strong&gt;&lt;/h3&gt; 
&lt;p&gt;Many customers want more frequent snapshots (hourly instead of daily) for tighter recovery objectives. Because each additional snapshot is billed only for its new unique data blocks, frequent backups are practical and affordable.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;How this feature helps: &lt;/strong&gt;You can take hourly snapshots where each one adds only ~0.2% in new unique data (assuming 5% daily change rate). More snapshots mean more recovery points and less data loss in a failure scenario, all at optimized cost.&lt;/p&gt; 
&lt;h3&gt;&lt;strong&gt;Cross-Region disaster recovery at lower cost&lt;/strong&gt;&lt;/h3&gt; 
&lt;p&gt;Snapshots copied to another region for disaster recovery are also billed based on unique data blocks. Organizations maintaining multi-Region disaster recovery (DR) strategies pay proportionally to actual data changes, making geographic redundancy affordable.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;How this feature helps:&lt;/strong&gt; If you are running active-passive or active-active multi-Region architectures, you can copy snapshots across Regions more frequently, improving cross-Region RPO while keeping DR costs proportional to actual data changes rather than full dataset size.&lt;/p&gt; 
&lt;h3&gt;&lt;strong&gt;Affordable extended backups &lt;/strong&gt;&lt;/h3&gt; 
&lt;p&gt;With the incremental snapshot billing model, extended manual backups are more affordable for customers, regardless of their workload size. Even retention policies (7-day, 14-day) cost proportionally to actual data changes, for enhanced data protection posture across the board.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;How this feature helps: &lt;/strong&gt;Customers no longer need to choose between data protection and budget. This billing model helps make extended retention cost effective for workloads of varying sizes.&lt;/p&gt; 
&lt;h2&gt;&lt;strong&gt;Pricing example&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;For example, you have an Amazon Redshift Serverless workgroup with 10 TB of active data in US East (Ohio). You take daily manual snapshots with 7-day retention. Your data changes at 5% per day (0.5 TB/day).&lt;/p&gt; 
&lt;table border="1px" width="584" cellpadding="5"&gt; 
 &lt;thead&gt; 
  &lt;tr&gt; 
   &lt;td width="168"&gt;&lt;strong&gt;Component&lt;/strong&gt;&lt;/td&gt; 
   &lt;td width="264"&gt;&lt;strong&gt;Calculation&lt;/strong&gt;&lt;/td&gt; 
   &lt;td width="152"&gt;&lt;strong&gt;Monthly Cost&lt;/strong&gt;&lt;/td&gt; 
  &lt;/tr&gt; 
 &lt;/thead&gt; 
 &lt;tbody&gt; 
  &lt;tr&gt; 
   &lt;td width="168"&gt;Active data&lt;/td&gt; 
   &lt;td width="264"&gt;10 TB × 1,024 GB/TB × $0.023&lt;/td&gt; 
   &lt;td width="152"&gt;$235.52&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td width="168"&gt;Unique snapshot blocks (after deduplication)&lt;/td&gt; 
   &lt;td width="264"&gt;13 TB × 1,024 GB/TB × $0.023&lt;/td&gt; 
   &lt;td width="152"&gt;$306.18&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td width="168"&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt; 
   &lt;td width="264"&gt;&lt;/td&gt; 
   &lt;td width="152"&gt;&lt;strong&gt;$541.70&lt;/strong&gt;&lt;/td&gt; 
  &lt;/tr&gt; 
 &lt;/tbody&gt; 
&lt;/table&gt; 
&lt;p&gt;Because shared blocks across snapshots are counted only once. You pay for 13 TB of unique snapshot data rather than the full cumulative size of all seven daily snapshots.&lt;/p&gt; 
&lt;h3&gt;&lt;strong&gt;Compounding savings on Amazon Redshift RG&lt;/strong&gt;&lt;/h3&gt; 
&lt;p&gt;If you are evaluating migrating from RA3 to RG, the savings stack significantly. Some of the compounding savings on RG include:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;RG instances are priced at 30% discount as compared to RA3 instances.&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/redshift/latest/mgmt/purchase-reserved-node-instance.html" target="_blank" rel="noopener"&gt;Reserved Instances&lt;/a&gt; (RI) pricing is available for RG which provide further compute savings.&lt;/li&gt; 
 &lt;li&gt;Incremental billing alleviates duplicate snapshot charges for backup storage.&lt;/li&gt; 
 &lt;li&gt;Data lake queries are included in RG compute pricing, thereby avoiding the per-terabyte scanning charges of Amazon Redshift Spectrum.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;The combined effect of these options for RG can deliver an aggregate greater than 30% cost reduction over RA3. You can lock in RI pricing on RG clusters for predictable, long-term savings on top of the incremental snapshot benefit.&lt;/p&gt; 
&lt;h2&gt;&lt;strong&gt;Getting started&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;No action is required on your end. Your existing manual snapshots automatically transition to the incremental snapshot billing model on June 8, 2026.&lt;/p&gt; 
&lt;p&gt;To maximize the benefit:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;&lt;strong&gt;Review your current snapshot usage&lt;/strong&gt; in the &lt;a href="https://console.aws.amazon.com/costmanagement/" target="_blank" rel="noopener"&gt;AWS Billing and Cost Management console&lt;/a&gt;.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Increase snapshot frequency&lt;/strong&gt;. More frequent snapshots now cost proportionally less since each additional snapshot only adds its unique data blocks to your bill.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Extend retention policies&lt;/strong&gt;. Compliance driven retention (90-day, 1-year, 7-year) is now significantly more affordable.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Evaluate RA3 to RG migration&lt;/strong&gt;. Consider the 30% compute savings, combined with RI eligibility during RG evaluation for migrating from RA3.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Explore Serverless&lt;/strong&gt;. The enhanced billing model makes Serverless a cost-effective option for customers who need backup retention beyond the 24-hour automated recovery point window.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;h2&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;The incremental snapshot billing model for Amazon Redshift Serverless and Amazon Redshift RG charges only for unique data blocks across your snapshots. This supports more frequent snapshots for better disaster recovery, affordable long-term compliance retention, and a compelling path to Amazon Redshift Serverless adoption. Combined with Amazon Redshift RG’s 30% compute discount and Reserved Instances, this delivers meaningful total cost savings across your entire Amazon Redshift spend.&lt;/p&gt; 
&lt;p&gt;Review your snapshot strategy today and share your feedback on &lt;a href="https://repost.aws/" target="_blank" rel="noopener"&gt;AWS re:Post&lt;/a&gt;. For full pricing details, visit the &lt;a href="https://aws.amazon.com/redshift/pricing/" target="_blank" rel="noopener"&gt;Amazon Redshift pricing page&lt;/a&gt;.&lt;/p&gt; 
&lt;hr&gt; 
&lt;h3&gt;About the authors&lt;/h3&gt; 
&lt;p style="clear: both"&gt;&lt;strong&gt;&lt;img loading="lazy" class="alignleft size-thumbnail wp-image-91891" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/06/05/6004a2-100x130.jpg" alt="" width="100" height="130"&gt;Nidhi Nayak&lt;/strong&gt;&lt;br&gt; &lt;a href="https://www.linkedin.com/in/nidhi-nayak-51511b21/" target="_blank" rel="noopener"&gt;Nidhi&lt;/a&gt; is a Senior Technical Account Manager with AWS, she helps enterprise customers build scalable, high-performance cloud applications and optimize cloud operations. With over a decade of experience in Data Analytics, Nidhi currently focuses on Redshift &amp;amp; Generative AI integration with Redshift.&lt;/p&gt; 
&lt;p style="clear: both"&gt;&lt;strong&gt;&lt;img loading="lazy" class="alignleft size-thumbnail wp-image-91890" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/06/05/6004a3-100x133.jpg" alt="" width="100" height="133"&gt;Raza Hafeez&lt;/strong&gt;&lt;br&gt; &lt;a href="https://www.linkedin.com/in/raza-hafeez/" target="_blank" rel="noopener"&gt;Raza&lt;/a&gt; is a Senior Product Manager, Technical at Amazon Redshift. He has 15+ years of experience building and optimizing enterprise data warehouses and is passionate about making cloud analytics accessible and cost-effective for customers of all sizes.&lt;/p&gt; 
&lt;p style="clear: both"&gt;&lt;strong&gt;&lt;img loading="lazy" class="alignleft size-thumbnail wp-image-91889" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/06/05/6004a1-100x100.jpg" alt="" width="100" height="100"&gt;Sushmita Barthakur&lt;/strong&gt;&lt;br&gt; &lt;a href="https://www.linkedin.com/in/sushmita-barthakur/" target="_blank" rel="noopener"&gt;Sushmita&lt;/a&gt; is a Senior Data Solutions Architect at AWS, supporting Strategic customers architect their data workloads on AWS. With a background in data analytics, she has extensive experience helping customers architect and build enterprise data lakes, ETL workloads, data warehouses and data analytics solutions, both on-premises and the cloud. Sushmita is based in Florida and enjoys traveling, reading and playing tennis.&lt;/p&gt; 
&lt;p style="clear: both"&gt;&lt;strong&gt;&lt;img loading="lazy" class="alignleft size-thumbnail wp-image-91888" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/06/05/6004a4-100x133.jpg" alt="" width="100" height="133"&gt;Amy Huang&lt;/strong&gt;&lt;br&gt; &lt;a href="https://www.linkedin.com/in/huangyaqian/" target="_blank" rel="noopener"&gt;Amy&lt;/a&gt; is a Senior Financial Analyst at AWS and a CPA with over 7 years of progressive experience across Strategic Finance, Banking, and Auditing. She specializes in pricing, financial modeling and valuation, and data-driven analysis. Outside of work, she enjoys yoga and hiking.&lt;/p&gt;</content:encoded>
					
					
			
		
		
			</item>
		<item>
		<title>Migrate JMS applications to Amazon MQ for RabbitMQ with minimal changes</title>
		<link>https://aws.amazon.com/blogs/big-data/migrate-jms-applications-to-amazon-mq-for-rabbitmq-with-minimal-changes/</link>
		
		<dc:creator><![CDATA[Vinodh Kannan Sadayamuthu]]></dc:creator>
		<pubDate>Mon, 08 Jun 2026 15:52:03 +0000</pubDate>
				<category><![CDATA[Advanced (300)]]></category>
		<category><![CDATA[Amazon MQ]]></category>
		<category><![CDATA[Technical How-to]]></category>
		<guid isPermaLink="false">957fae46e8aadf2fbc159e22ab5bbc9728ccf4e8</guid>

					<description>This post shows you how to migrate your JMS applications and walks through a complete setup, from creating the broker to sending and receiving messages. You will also see a real-world scenario: migrating an existing Apache ActiveMQ workload to an Amazon MQ broker running RabbitMQ. The post covers configuration changes, monitoring with Amazon CloudWatch, and validation steps to make sure that your migration succeeds.</description>
										<content:encoded>&lt;p&gt;Running JMS applications on on-premises brokers or Apache ActiveMQ requires manual patching cycles, capacity planning for peak loads, and maintaining high availability across multiple data centers. With &lt;a href="https://docs.aws.amazon.com/amazon-mq/latest/developer-guide/rabbitmq-4.html" target="_blank" rel="noopener"&gt;Amazon MQ version 4&lt;/a&gt; and above, you can migrate your existing JMS applications without rewriting your messaging layer, removing weeks of rewrite work.&lt;/p&gt; 
&lt;p&gt;This post shows you how to migrate your JMS applications and walks through a complete setup, from creating the broker to sending and receiving messages. You will also see a real-world scenario: migrating an existing Apache ActiveMQ workload to an Amazon MQ broker running RabbitMQ. The post covers configuration changes, monitoring with Amazon CloudWatch, and validation steps to make sure that your migration succeeds.&lt;/p&gt; 
&lt;p&gt;Amazon MQ version 4 and above includes built-in support for the RabbitMQ JMS Client and the JMS Topic Exchange plugin. The RabbitMQ JMS Client and JMS Topic Exchange plugin work together, allowing your existing JMS applications to connect using familiar JMS APIs. You update the connection factory configuration and broker endpoint. Your business logic, message producers, consumers, and listeners stay exactly as written.&lt;/p&gt; 
&lt;h2&gt;Understanding JMS and AMQP&lt;/h2&gt; 
&lt;h3&gt;How the RabbitMQ JMS Client works&lt;/h3&gt; 
&lt;p&gt;Use the RabbitMQ JMS Client to connect your Java application to Amazon MQ. The client translates your JMS API calls (javax.jms or jakarta.jms) into AMQP 0-9-1 messages that the broker understands.&lt;/p&gt; 
&lt;p&gt;Advanced Message Queuing Protocol (AMQP) defines how messages are formatted and transmitted across the network at the wire level. This means that non-Java services can consume the same messages using native AMQP clients, making the protocol language-agnostic&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/23/BDB-5769-1.png" alt="Architecture diagram showing JMS to AMQP translation: Java and Spring applications use the JMS API and RabbitMQ JMS Client Library to communicate with an Amazon MQ broker running RabbitMQ via the AMQP 0-9-1 protocol."&gt;&lt;/p&gt; 
&lt;h3&gt;JMS version support&lt;/h3&gt; 
&lt;p&gt;Migrate at the JMS version that your application already uses. The client supports JMS 1.1, 2.0, and 3.1 (Jakarta Messaging), so you don’t need to upgrade your application code before migrating brokers. The client integrates with Spring Framework and Spring Boot applications without requiring custom bean factories or application context configuration.&lt;/p&gt; 
&lt;p&gt;Because the JMS abstraction layer sits between your application and the broker, most migrations require only a connection factory change, not a logic rewrite.&lt;/p&gt; 
&lt;h3&gt;RabbitMQ JMS Topic Exchange plugin&lt;/h3&gt; 
&lt;p&gt;Your existing publish/subscribe patterns work without client-side routing logic. The JMS Topic Exchange plugin adds server-side support for JMS topic semantics, handling topic routing and SQL-based message selection directly in the broker.&lt;/p&gt; 
&lt;p&gt;The plugin handles SQL-based message selection &lt;code&gt;(JMS selectors like OrderType = `Electronics` AND Priority &amp;gt; 5)&lt;/code&gt; and topic hierarchies with wildcard pattern matching (* for single level, # for multiple levels). Your application uses standard JMS topic APIs &lt;code&gt;(createTopic(), setMessageSelector())&lt;/code&gt; without additional filtering logic.&lt;/p&gt; 
&lt;h2&gt;Getting started&lt;/h2&gt; 
&lt;p&gt;This walkthrough shows you how to set up Amazon MQ and connect your existing JMS application. You will create a broker, configure the connection factory, and send and receive messages.&lt;/p&gt; 
&lt;h3&gt;Prerequisites&lt;/h3&gt; 
&lt;p&gt;You need an existing JMS application built on Apache ActiveMQ or another JMS provider to migrate. If you don’t have one, you can still follow Steps 1–5 to create a broker and test the connection pattern. Before you begin, confirm that you have the following in place:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;An active AWS account&lt;/li&gt; 
 &lt;li&gt;AWS Command Line Interface (AWS CLI) installed. For instructions, see &lt;a href="https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html" target="_blank" rel="noopener noreferrer"&gt;Installing the AWS CLI&lt;/a&gt;.&lt;/li&gt; 
 &lt;li&gt;Java 11 or later installed on your local development environment.&lt;/li&gt; 
 &lt;li&gt;An AWS Identity and Access Management (IAM) principal (user or role) with the &lt;a href="https://docs.aws.amazon.com/aws-managed-policy/latest/reference/AmazonMQFullAccess.html" target="_blank" rel="noopener noreferrer"&gt;AmazonMQFullAccess&lt;/a&gt; managed policy attached.&lt;/li&gt; 
 &lt;li&gt;Maven or Gradle for dependency management.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;Amazon MQ broker charges apply based on instance type and usage. Review the &lt;a href="https://aws.amazon.com/amazon-mq/pricing/" target="_blank" rel="noopener noreferrer"&gt;Amazon MQ pricing page&lt;/a&gt; before you start.&lt;/p&gt; 
&lt;h3&gt;Step 1: Create an Amazon MQ for RabbitMQ broker&lt;/h3&gt; 
&lt;p&gt;The following command creates a single-instance broker running RabbitMQ 4.2 on an mq.m7g.medium instance.&lt;/p&gt; 
&lt;pre&gt;&lt;code class="lang-bash"&gt;aws mq create-broker \ 
--broker-name my-rabbitmq-broker \ 
--engine-type rabbitmq \ 
--engine-version 4.2 \ 
--deployment-mode SINGLE_INSTANCE \ 
--host-instance-type mq.m7g.medium \ 
--auto-minor-version-upgrade \ 
--publicly-accessible \ 
--users "Username=admin,Password=[PASSWORD]" \ 
--region us-west-2&lt;/code&gt;&lt;/pre&gt; 
&lt;p&gt;Replace &lt;code&gt;&amp;lt;broker-name&amp;gt;&lt;/code&gt; with the name that you want to give to the broker. Replace &lt;code&gt;&amp;lt;username&amp;gt;&lt;/code&gt; and &lt;code&gt;&amp;lt;password&amp;gt;&lt;/code&gt; as described in the &lt;a href="https://docs.aws.amazon.com/cli/latest/reference/mq/create-broker.html" target="_blank" rel="noopener noreferrer"&gt;create-broker CLI documentation&lt;/a&gt;. After the command runs successfully, the command line displays the &lt;code&gt;BrokerArn&lt;/code&gt; and &lt;code&gt;BrokerId&lt;/code&gt;.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; This command creates a publicly accessible broker for demonstration purposes only. For production workloads, create brokers in private subnets within your VPC and restrict access using security groups. Don’t use the –publicly-accessible flag. For more information, see &lt;a href="https://docs.aws.amazon.com/amazon-mq/latest/developer-guide/using-amazon-mq-securely.html" target="_blank" rel="noopener noreferrer"&gt;Security best practices for Amazon MQ&lt;/a&gt;.&lt;/p&gt; 
&lt;p&gt;The command returns output similar to:&lt;/p&gt; 
&lt;pre&gt;&lt;code class="lang-json"&gt;{
    "BrokerArn": "arn:aws:mq:us-west-2:111122223333:broker:my-rabbitmq-broker:b-c8352341-ec91-4a78-ad9c-a57f23f235bb",
    "BrokerId": "b-c8352341-ec91-4a78-ad9c-a57f23f235bb"
}&lt;/code&gt;&lt;/pre&gt; 
&lt;p&gt;Save the &lt;code&gt;BrokerId&lt;/code&gt; value for the next step.&lt;/p&gt; 
&lt;p&gt;The broker takes approximately 15–20 minutes to reach the Running state. Run the following command every 2 minutes to check the status:&lt;/p&gt; 
&lt;pre&gt;&lt;code class="lang-bash"&gt;aws mq describe-broker --broker-id &amp;lt;BrokerId&amp;gt; --region us-west-2 --query 'BrokerState' &lt;/code&gt;&lt;/pre&gt; 
&lt;p&gt;Proceed to the next step after the broker state is RUNNING.&lt;/p&gt; 
&lt;p&gt;To get the broker endpoints, run:&lt;/p&gt; 
&lt;pre&gt;&lt;code class="lang-bash"&gt;aws mq describe-broker --broker-id &amp;lt;BrokerId&amp;gt; --region us-west-2 --query 'BrokerInstances'&lt;/code&gt;&lt;/pre&gt; 
&lt;p&gt;Note the &lt;code&gt;ConsoleURL&lt;/code&gt; and &lt;code&gt;Endpoints&lt;/code&gt; from the output. The command returns output similar to:&lt;/p&gt; 
&lt;pre&gt;&lt;code class="lang-json"&gt;[{
    "ConsoleURL": "https:// b-c8352341-ec91-4a78-ad9c-a57f23f235bb.mq.us-west-2.on.aws",
    "Endpoints": ["amqps://b-c8352341-ec91-4a78-ad9c-a57f23f235bb.mq.us-west-2.on.aws:5671"]
}]&lt;/code&gt;&lt;/pre&gt; 
&lt;div class="hide-language"&gt; 
 &lt;h3&gt;Step 2: Add the RabbitMQ JMS Client dependency&lt;/h3&gt; 
&lt;/div&gt; 
&lt;p&gt;Choose the dependency that matches your application’s current JMS version. If your imports reference javax.jms packages, use version 2.12.0. If your imports reference jakarta.jms packages (JMS 3.1 / Jakarta EE 9+), use version 3.4.0.&lt;/p&gt; 
&lt;p&gt;For JMS 1.1 and 2.0 (javax.jms):&lt;/p&gt; 
&lt;pre&gt;&lt;code class="lang-xml"&gt;&amp;lt;dependency&amp;gt;
&amp;lt;groupId&amp;gt;com.rabbitmq.jms&amp;lt;/groupId&amp;gt;
&amp;lt;artifactId&amp;gt;rabbitmq-jms&amp;lt;/artifactId&amp;gt;
&amp;lt;version&amp;gt;2.12.0&amp;lt;/version&amp;gt;
&amp;lt;/dependency&amp;gt;&lt;/code&gt;&lt;/pre&gt; 
&lt;div class="hide-language"&gt; 
 &lt;p&gt;For JMS 3.1 / Jakarta JMS 3.1 / Jakarta Messaging (jakarta.jms):&lt;/p&gt; 
&lt;/div&gt; 
&lt;pre&gt;&lt;code class="lang-xml"&gt;&amp;lt;dependency&amp;gt;
&amp;lt;groupId&amp;gt;com.rabbitmq.jms&amp;lt;/groupId&amp;gt;
&amp;lt;artifactId&amp;gt;rabbitmq-jms&amp;lt;/artifactId&amp;gt;
&amp;lt;version&amp;gt;3.4.0&amp;lt;/version&amp;gt;
&amp;lt;/dependency&amp;gt;&lt;/code&gt;&lt;/pre&gt; 
&lt;h3&gt;Step 3: Configure the connection factory&lt;/h3&gt; 
&lt;p&gt;Store your broker credentials in &lt;a href="https://docs.aws.amazon.com/secretsmanager/latest/userguide/retrieving-secrets_cli.html" target="_blank" rel="noopener noreferrer"&gt;AWS Secrets Manager&lt;/a&gt; before configuring the connection factory. This keeps credentials out of your source code and configuration files.&lt;/p&gt; 
&lt;p&gt;Create the secret:&lt;/p&gt; 
&lt;pre&gt;&lt;code class="lang-bash"&gt;aws secretsmanager create-secret \
--name dev-rabbitmq \
--description "Amazon MQ broker credentials" \
--secret-string '{"username":"admin","password":"[PASSWORD]"}' \ 
--region us-west-2&lt;/code&gt;&lt;/pre&gt; 
&lt;p&gt;Add the AWS SDK for Secrets Manager to your pom.xml:&lt;/p&gt; 
&lt;pre&gt;&lt;code class="lang-xml"&gt;&amp;lt;dependency&amp;gt; 
&amp;lt;groupId&amp;gt;software.amazon.awssdk&amp;lt;/groupId&amp;gt; 
&amp;lt;artifactId&amp;gt;secretsmanager&amp;lt;/artifactId&amp;gt; 
&amp;lt;version&amp;gt;2.20.0&amp;lt;/version&amp;gt; &amp;lt;/dependency&amp;gt;

&amp;lt;dependency&amp;gt; 
&amp;lt;groupId&amp;gt;com.fasterxml.jackson.core&amp;lt;/groupId&amp;gt; 
&amp;lt;artifactId&amp;gt;jackson-databind&amp;lt;/artifactId&amp;gt; 
&amp;lt;version&amp;gt;2.15.0&amp;lt;/version&amp;gt; 
&amp;lt;/dependency&amp;gt;&lt;/code&gt;&lt;/pre&gt; 
&lt;p&gt;Replace your existing broker URL with the Amazon MQ endpoint. In most cases, this is the only change required in your application configuration:&lt;/p&gt; 
&lt;pre&gt;&lt;code class="lang-java"&gt;import com.rabbitmq.jms.admin.RMQConnectionFactory;
import software.amazon.awssdk.services.secretsmanager.SecretsManagerClient;
import software.amazon.awssdk.services.secretsmanager.model.GetSecretValueRequest;
import software.amazon.awssdk.services.secretsmanager.model.GetSecretValueResponse;
import software.amazon.awssdk.regions.Region;
import com.fasterxml.jackson.core.type.TypeReference;
import com.fasterxml.jackson.databind.ObjectMapper;
import java.util.Map;
import javax.jms.*; 

// Retrieve credentials from AWS Secrets Manager
Map&amp;lt;String, String&amp;gt; creds;
try (SecretsManagerClient secretsClient = SecretsManagerClient.builder().region(Region.US_WEST_2).build()) {    
    GetSecretValueResponse response = secretsClient.getSecretValue(GetSecretValueRequest.builder().secretId("dev-rabbitmq").build());        
    ObjectMapper objectMapper = new ObjectMapper();    
    creds = objectMapper.readValue(response.secretString(),new TypeReference&amp;lt;Map&amp;lt;String, String&amp;gt;&amp;gt;() {});
    } 

// Create and configure the connection factory
RMQConnectionFactory connectionFactory = new RMQConnectionFactory();
connectionFactory.setHost("b-c8352341-ec91-4a78-ad9c-a57f23f235bb.mq.us-west-2.on.aws");
connectionFactory.setPort(5671);connectionFactory.setVirtualHost("/");
connectionFactory.useSslProtocol();
connectionFactory.setUsername(creds.get("username"));
connectionFactory.setPassword(creds.get("password")); 
&lt;/code&gt;&lt;/pre&gt; 
&lt;p&gt;Replace the host value with your broker endpoint from Step 1.&lt;/p&gt; 
&lt;p&gt;The connection factory requires four parameters:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Host: Your broker endpoint from the describe-broker output&lt;/li&gt; 
 &lt;li&gt;Port: 5671 for AMQP over TLS (Amazon MQ requires encryption in transit)&lt;/li&gt; 
 &lt;li&gt;VirtualHost: “/” (the default RabbitMQ virtual host)&lt;/li&gt; 
 &lt;li&gt;UseSslProtocol: true (required by Amazon MQ)&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h3&gt;Step 4: Send messages&lt;/h3&gt; 
&lt;p&gt;The following examples show how to send messages to a queue and a topic using the JMS 2.0 simplified API.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Point-to-point (queue) for one-to-one delivery:&lt;/strong&gt;&lt;/p&gt; 
&lt;pre&gt;&lt;code class="lang-java"&gt;try (JMSContext context = connectionFactory.createContext()) {
Queue queue = context.createQueue("orders");
context.createProducer().setProperty("OrderType", "Electronics").send(queue, "Order #12345");
System.out.println("Sent message to queue: orders");
}&lt;/code&gt;&lt;/pre&gt; 
&lt;p&gt;&lt;strong style="font-family: Georgia, 'Times New Roman', 'Bitstream Charter', Times, serif"&gt; Publish/subscribe (topic) for one-to-many broadcast:&lt;/strong&gt;&lt;/p&gt; 
&lt;pre&gt;&lt;code class="lang-java"&gt;try (JMSContext context = connectionFactory.createContext()) {
	Topic topic = context.createTopic("orders.electronics");
	context.createProducer().setProperty("MessageType", "Broadcast").send(topic, "New electronics order received!");
	System.out.println("Published message to topic: orders.electronics");
}&lt;/code&gt;&lt;/pre&gt; 
&lt;p&gt;Message properties&lt;code&gt;(OrderType, MessageType)&lt;/code&gt; are JMS headers that consumers can use for filtering. These properties become AMQP message headers when transmitted to the broker.&lt;/p&gt; 
&lt;h3&gt;Step 5: Receive messages asynchronously&lt;/h3&gt; 
&lt;p&gt;To receive messages asynchronously, attach a &lt;code&gt;MessageListener&lt;/code&gt; to a consumer. The listener fires each time a message arrives.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Queue consumer:&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;Asynchronous consumers process messages in a background thread without blocking your main application logic. The MessageListener callback fires each time a message arrives, allowing your application to handle messages as they’re delivered rather than polling with &lt;code&gt;receive()&lt;/code&gt;.&lt;/p&gt; 
&lt;pre&gt;&lt;code class="lang-java"&gt;try (JMSContext context = connectionFactory.createContext()) {
	Queue queue = context.createQueue("orders");
	JMSConsumer consumer = context.createConsumer(queue);
	consumer.setMessageListener(message -&amp;gt; {
		if (message instanceof TextMessage) {
			try {
				System.out.println("Received: " + ((TextMessage) message).getText());
			} catch (JMSException e) {
				e.printStackTrace();
			}
		}
	});

	System.out.println("Listening for messages on queue: orders");

	// Keep the consumer active for 30 seconds
	Thread.sleep(30000);
}&lt;/code&gt;&lt;/pre&gt; 
&lt;p&gt;&lt;strong&gt;Topic subscriber:&lt;/strong&gt;&lt;/p&gt; 
&lt;pre&gt;&lt;code class="lang-java"&gt;try (JMSContext context = connectionFactory.createContext()) {
	Topic topic = context.createTopic("orders.electronics");
	JMSConsumer consumer = context.createConsumer(topic);
	consumer.setMessageListener(message -&amp;gt; {
		if (message instanceof TextMessage) {
			try {
				System.out.println("Subscriber received: " + ((TextMessage) message).getText());
			} catch (JMSException e) {
				e.printStackTrace();
			}
		}
	});

	System.out.println("Subscribed to topic: orders.electronics");

	// Keep the consumer active for 30 seconds
	Thread.sleep(30000);}&lt;/code&gt;&lt;/pre&gt; 
&lt;p&gt;The &lt;code&gt;Thread.sleep(30000)&lt;/code&gt; call keeps the consumer active for 30 seconds.&lt;/p&gt; 
&lt;h2&gt;Use case: Migrating an ActiveMQ Workload to Amazon MQ for RabbitMQ&lt;/h2&gt; 
&lt;p&gt;Migrate your Apache ActiveMQ applications to Amazon MQ by updating four configuration points. Your business logic, message producers, consumers, and listeners stay exactly as written. This walkthrough uses a real JMS 1.1 application with a centralized broker configuration class to show precisely which lines change and which remain identical.&lt;/p&gt; 
&lt;p&gt;Apache ActiveMQ powers messaging infrastructure for thousands of Java applications worldwide. If you run JMS applications on ActiveMQ, you can migrate to Amazon MQ for RabbitMQ with minimal code changes. The following steps demonstrate a complete migration using an application that includes a centralized broker configuration class, a message producer, and a message consumer.&lt;/p&gt; 
&lt;h3&gt;Step 1: Update the Maven dependency&lt;/h3&gt; 
&lt;p&gt;Replace the ActiveMQ client dependencies with the RabbitMQ JMS client in your pom.xml. The rabbitmq-jms artifact includes the RabbitMQ AMQP client and JMS API as transitive dependencies, so a single entry replaces both ActiveMQ artifacts.&lt;/p&gt; 
&lt;p&gt;Before (ActiveMQ):&lt;/p&gt; 
&lt;pre&gt;&lt;code class="lang-xml"&gt;&amp;lt;dependency&amp;gt;
&amp;lt;groupId&amp;gt;org.apache.activemq&amp;lt;/groupId&amp;gt;
&amp;lt;artifactId&amp;gt;activemq-client&amp;lt;/artifactId&amp;gt;
&amp;lt;version&amp;gt;5.18.6&amp;lt;/version&amp;gt;
&amp;lt;/dependency&amp;gt;

&amp;lt;dependency&amp;gt;
&amp;lt;groupId&amp;gt;org.apache.activemq&amp;lt;/groupId&amp;gt;
&amp;lt;artifactId&amp;gt;activemq-pool&amp;lt;/artifactId&amp;gt;
&amp;lt;version&amp;gt;5.18.6&amp;lt;/version&amp;gt;n&amp;gt;5.18.6&amp;lt;/version&amp;gt;
&amp;lt;/dependency&amp;gt;&lt;/code&gt;&lt;/pre&gt; 
&lt;p&gt;After (Amazon MQ):&lt;/p&gt; 
&lt;pre&gt;&lt;code class="lang-xml"&gt;&amp;lt;dependency&amp;gt;
&amp;lt;groupId&amp;gt;com.rabbitmq.jms&amp;lt;/groupId&amp;gt;
&amp;lt;artifactId&amp;gt;rabbitmq-jms&amp;lt;/artifactId&amp;gt;
&amp;lt;version&amp;gt;2.12.0&amp;lt;/version&amp;gt;
&amp;lt;/dependency&amp;gt;&lt;/code&gt;&lt;/pre&gt; 
&lt;p&gt;The &lt;code&gt;rabbitmq-jms&lt;/code&gt; artifact pulls in the RabbitMQ AMQP client and the JMS API as transitive dependencies, so a single entry replaces both ActiveMQ artifacts.&lt;/p&gt; 
&lt;h3&gt;Step 2: Update the broker configuration&lt;/h3&gt; 
&lt;p&gt;If your application centralizes connection details in a shared configuration class, that class is the only file that needs to change. The queue name and everything else your application references remain the same.&lt;/p&gt; 
&lt;p&gt;Before (ActiveMQ):&lt;/p&gt; 
&lt;pre&gt;&lt;code class="lang-java"&gt;// BrokerConfig.java - ActiveMQ version
public final class BrokerConfig {
	// OpenWire endpoint
	public static final String BROKER_URL = "tcp://localhost:61616";
	public static final String USERNAME = "[PASSWORD]";
	public static final String PASSWORD = "[PASSWORD]";
	public static final String QUEUE_NAME = "demo.queue";
	private BrokerConfig() {}}&lt;/code&gt;&lt;/pre&gt; 
&lt;p&gt;After (Amazon MQ):&lt;/p&gt; 
&lt;pre&gt;&lt;code class="lang-java"&gt;// BrokerConfig.java - Amazon MQ version

import com.fasterxml.jackson.core.type.TypeReference;import com.fasterxml.jackson.databind.ObjectMapper;
import software.amazon.awssdk.regions.Region;
import software.amazon.awssdk.services.secretsmanager.SecretsManagerClient;
import software.amazon.awssdk.services.secretsmanager.model.GetSecretValueRequest;
import software.amazon.awssdk.services.secretsmanager.model.GetSecretValueResponse;
import java.util.Map;

public final class BrokerConfig {
	// AMQPS endpoint (TLS required by Amazon MQ)
	public static final String BROKER_URL = "amqps://b-c8352341-ec91-4a78-ad9c-a57f23f235bb.mq.us-west-2.on.aws:5671";

	// Queue name carries over unchanged
	public static final String QUEUE_NAME = "demo.queue";

	// Secret name in AWS Secrets Manager
	private static final String SECRET_ID = "dev-rabbitmq";
	private static final Map&amp;lt;String, String&amp;gt; CREDENTIALS = loadCredentials();
	public static String getUsername() {return CREDENTIALS.get("username");}
	public static String getPassword() {return CREDENTIALS.get("password");}
	private static Map&amp;lt;String, String&amp;gt; loadCredentials() { 
		try (SecretsManagerClient client = SecretsManagerClient.builder().region(Region.US_WEST_2).build()) {
			GetSecretValueResponse response = client.getSecretValue(GetSecretValueRequest.builder().secretId(SECRET_ID).build());
			ObjectMapper mapper = new ObjectMapper();
			return mapper.readValue(response.secretString(), new TypeReference&amp;lt;Map&amp;lt;String, String&amp;gt;&amp;gt;() {});
		} catch (Exception e) {
			throw new RuntimeException("Failed to load broker credentials from Secrets Manager", e);
		}
	}
	private BrokerConfig() {}}&lt;/code&gt;&lt;/pre&gt; 
&lt;p&gt;Two things changed in this file compared to the ActiveMQ version: the protocol prefix (&lt;code&gt;tcp://&lt;/code&gt; to &lt;code&gt;amqps://&lt;/code&gt;) and the host and port (OpenWire on 61616 to AMQP over TLS on 5671). The queue name is identical. Credentials are no longer stored as static string constants. Instead, &lt;code&gt;loadCredentials()&lt;/code&gt; retrieves them from &lt;a href="https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/java_secrets-manager_code_examples.html" target="_blank" rel="noopener noreferrer"&gt;AWS Secrets Manager&lt;/a&gt; at startup, and &lt;code&gt;getUsername()&lt;/code&gt; and &lt;code&gt;getPassword()&lt;/code&gt; expose them to the rest of the application. This follows AWS security best practices and streamlines credential rotation.&lt;/p&gt; 
&lt;h2&gt;Step 3: Update the message producer&lt;/h2&gt; 
&lt;p&gt;The producer requires two changes: the import statement and the factory instantiation. Every JMS API call after the factory (&lt;code&gt;createConnection&lt;/code&gt;, &lt;code&gt;createSession&lt;/code&gt;, &lt;code&gt;createProducer&lt;/code&gt;, &lt;code&gt;send&lt;/code&gt;) is identical to the ActiveMQ version.&lt;/p&gt; 
&lt;p&gt;Before (ActiveMQ):&lt;/p&gt; 
&lt;pre&gt;&lt;code class="lang-java"&gt;import org.apache.activemq.ActiveMQConnectionFactory;
import javax.jms.*;

public class MessageProducer {
	public static void main(String[] args) {
		Connection connection = null;
		try {
			ActiveMQConnectionFactory factory = new ActiveMQConnectionFactory(BrokerConfig.USERNAME,BrokerConfig.PASSWORD,BrokerConfig.BROKER_URL);
			connection = factory.createConnection();
			connection.start();
			Session session = connection.createSession(false, Session.AUTO_ACKNOWLEDGE);
			Destination destination = session.createQueue(BrokerConfig.QUEUE_NAME);
			javax.jms.MessageProducer producer = session.createProducer(destination);
			producer.setDeliveryMode(DeliveryMode.PERSISTENT);
			for (int i = 1; i &amp;lt;= 5; i++) {
				TextMessage message = session.createTextMessage("Hello from ActiveMQ - message #" + i);
				producer.send(message);
				System.out.println("Sent: " + message.getText());
			}
			producer.close();
			session.close();
		} catch (JMSException e) {
			e.printStackTrace();
		} finally {
			if (connection != null) {
				try { 
					connection.close();
				} catch (JMSException ignored) {}}
			}
		}
	}&lt;/code&gt;&lt;/pre&gt; 
&lt;p&gt;After (Amazon MQ):&lt;/p&gt; 
&lt;pre&gt;&lt;code class="lang-java"&gt;import com.rabbitmq.jms.admin.RMQConnectionFactory;
import javax.jms.*;

public class MessageProducer {

	public static void main(String[] args) {
		Connection connection = null;
		try {
			RMQConnectionFactory factory = new RMQConnectionFactory();
			factory.setUri(BrokerConfig.BROKER_URL);
			factory.setUsername(BrokerConfig.getUsername());
			factory.setPassword(BrokerConfig.getPassword());

			// Everything below this line is identical to the ActiveMQ version
			connection = factory.createConnection();
			connection.start();
			Session session = connection.createSession(false, Session.AUTO_ACKNOWLEDGE);
			Destination destination = session.createQueue(BrokerConfig.QUEUE_NAME);
			javax.jms.MessageProducer producer = session.createProducer(destination);
			producer.setDeliveryMode(DeliveryMode.PERSISTENT);
			for (int i = 1; i &amp;lt;= 5; i++) {
				TextMessage message = session.createTextMessage("Hello from Amazon MQ - message #" + i);
				producer.send(message);
				System.out.println("Sent: " + message.getText());
			}
			producer.close();
			session.close();
		} catch (JMSException e) {
			e.printStackTrace();
		} finally {
			if (connection != null) {
				try {
					connection.close();
				} catch (JMSException ignored) {}
			}
		}
	}
}&lt;/code&gt;&lt;/pre&gt; 
&lt;p&gt;The import changes from &lt;code&gt;org.apache.activemq.ActiveMQConnectionFactory&lt;/code&gt; to &lt;code&gt;com.rabbitmq.jms.admin.RMQConnectionFactory&lt;/code&gt;. The factory construction switches from a constructor that accepts credentials and URL to a no-arg constructor with explicit setter calls. Credentials are now retrieved from AWS Secrets Manager through &lt;code&gt;BrokerConfig.getUsername()&lt;/code&gt; and &lt;code&gt;BrokerConfig.getPassword()&lt;/code&gt;. That is the complete change set for the producer.&lt;/p&gt; 
&lt;h2&gt;Step 4: Update the message consumer&lt;/h2&gt; 
&lt;p&gt;The consumer follows the same pattern as the producer. Swap the factory class and import, update the credential calls, and keep everything else.&lt;/p&gt; 
&lt;p&gt;Before (ActiveMQ):&lt;/p&gt; 
&lt;pre&gt;&lt;code class="lang-java"&gt;import org.apache.activemq.ActiveMQConnectionFactory;
import javax.jms.*;
public class MessageConsumer {
	public static void main(String[] args) {
		Connection connection = null;
		try {
			ActiveMQConnectionFactory factory = new ActiveMQConnectionFactory(BrokerConfig.USERNAME, BrokerConfig.PASSWORD, BrokerConfig.BROKER_URL);
			connection = factory.createConnection();
			connection.start();
			Session session = connection.createSession(false, Session.AUTO_ACKNOWLEDGE);
			Destination destination = session.createQueue(BrokerConfig.QUEUE_NAME);
			javax.jms.MessageConsumer consumer = session.createConsumer(destination);
			System.out.println("Waiting for messages on queue: " + BrokerConfig.QUEUE_NAME);
			Message message;
			while ((message = consumer.receive(10000)) != null) {
				if (message instanceof TextMessage) {
					TextMessage textMessage = (TextMessage) message;
					System.out.println("Received: " + textMessage.getText());
				}
			}
			consumer.close();
			session.close();
		} catch (JMSException e) {
			e.printStackTrace();
		} finally {
			if (connection != null) {
				try { 
					connection.close(); 
				} catch (JMSException ignored) {}
			}
		}
	}
}&lt;/code&gt;&lt;/pre&gt; 
&lt;p&gt;After (Amazon MQ):&lt;/p&gt; 
&lt;pre&gt;&lt;code class="lang-java"&gt;import com.rabbitmq.jms.admin.RMQConnectionFactory;
import javax.jms.*;

public class MessageConsumer {
	public static void main(String[] args) {
		Connection connection = null;
		try {
			RMQConnectionFactory factory = new RMQConnectionFactory();
			factory.setUri(BrokerConfig.BROKER_URL);
			factory.setUsername(BrokerConfig.getUsername());
			factory.setPassword(BrokerConfig.getPassword());

			// Everything below this line is identical to the ActiveMQ version
			connection = factory.createConnection();
			connection.start();
			Session session = connection.createSession(false, Session.AUTO_ACKNOWLEDGE);
			Destination destination = session.createQueue(BrokerConfig.QUEUE_NAME);
			javax.jms.MessageConsumer consumer = session.createConsumer(destination);
			System.out.println("Waiting for messages on queue: " + BrokerConfig.QUEUE_NAME);
			Message message;
			while ((message = consumer.receive(10000)) != null) {
				if (message instanceof TextMessage) {
					TextMessage textMessage = (TextMessage) message;
					System.out.println("Received: " + textMessage.getText());
				}
			}
			consumer.close();
			session.close();
		} catch (JMSException e) {
			e.printStackTrace();
		} finally {
			if (connection != null) {
				try { 
					connection.close(); 
				} catch (JMSException ignored) {}
			}
		}
	}
}&lt;/code&gt;&lt;/pre&gt; 
&lt;p&gt;The import changes from &lt;code&gt;org.apache.activemq.ActiveMQConnectionFactory&lt;/code&gt; to &lt;code&gt;com.rabbitmq.jms.admin.RMQConnectionFactory&lt;/code&gt;. The factory construction switches to a no-arg constructor with explicit setter calls, and &lt;code&gt;BrokerConfig.USERNAME&lt;/code&gt; / &lt;code&gt;BrokerConfig.PASSWORD&lt;/code&gt; are replaced with &lt;code&gt;BrokerConfig.getUsername()&lt;/code&gt; / &lt;code&gt;BrokerConfig.getPassword()&lt;/code&gt;. The session creation, queue lookup, consumer setup, and message processing loop are identical to the ActiveMQ version.&lt;/p&gt; 
&lt;h3&gt;Configuration&lt;/h3&gt; 
&lt;p&gt;The following table summarizes the changes required when migrating from Apache ActiveMQ.&lt;/p&gt; 
&lt;p&gt;&amp;nbsp;&lt;/p&gt; 
&lt;table border="1px" cellpadding="10px"&gt; 
 &lt;tbody&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;ActiveMQ&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Amazon MQ for RabbitMQ&lt;/strong&gt;&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Maven dependency&lt;/td&gt; 
   &lt;td&gt;activemq-client 5.18.6&lt;/td&gt; 
   &lt;td&gt;rabbitmq-jms 2.12.0&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Connection factory class&lt;/td&gt; 
   &lt;td&gt;ActiveMQConnectionFactory&lt;/td&gt; 
   &lt;td&gt;RMQConnectionFactory&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Import package&lt;/td&gt; 
   &lt;td&gt;org.apache.activemq&lt;/td&gt; 
   &lt;td&gt;com.rabbitmq.jms.admin&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Broker URL format&lt;/td&gt; 
   &lt;td&gt;tcp://host:61616&lt;/td&gt; 
   &lt;td&gt;amqps://broker-id.mq.region.on.aws:5671&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Protocol&lt;/td&gt; 
   &lt;td&gt;OpenWire&lt;/td&gt; 
   &lt;td&gt;AMQP 0-9-1&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Port&lt;/td&gt; 
   &lt;td&gt;61616 (OpenWire)&lt;/td&gt; 
   &lt;td&gt;5671 (AMQP over TLS)&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;TLS&lt;/td&gt; 
   &lt;td&gt;Optional&lt;/td&gt; 
   &lt;td&gt;Required&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Credentials&lt;/td&gt; 
   &lt;td&gt;Plain text / JNDI&lt;/td&gt; 
   &lt;td&gt;AWS Secrets Manager (recommended)&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Virtual host&lt;/td&gt; 
   &lt;td&gt;N/A&lt;/td&gt; 
   &lt;td&gt;/ (default)&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;JMS version support&lt;/td&gt; 
   &lt;td&gt;JMS 1.1&lt;/td&gt; 
   &lt;td&gt;JMS 1.1, 2.0, 3.1 (Jakarta)&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Queue/Topic names&lt;/td&gt; 
   &lt;td&gt;demo.queue&lt;/td&gt; 
   &lt;td&gt;demo.queue (no change)&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;JMS API calls&lt;/td&gt; 
   &lt;td&gt;Standard JMS 1.1&lt;/td&gt; 
   &lt;td&gt;Standard JMS 1.1 (no change)&lt;/td&gt; 
  &lt;/tr&gt; 
 &lt;/tbody&gt; 
&lt;/table&gt; 
&lt;h2&gt;Validating the migration&lt;/h2&gt; 
&lt;p&gt;Run your application against Amazon MQ for RabbitMQ in a staging environment before directing production traffic to the new broker. Verify that messages flow correctly, consumers process as expected, and no data loss occurs during cutover.&lt;/p&gt; 
&lt;p&gt;The RabbitMQ management console provides real-time visibility into broker operations. Access it through the ConsoleURL from your broker details. The console shows queue depths, consumer counts, and message rates. Use it during testing to identify routing or throughput issues before production deployment&lt;/p&gt; 
&lt;p&gt;The console displays &lt;code&gt;jms.durable.queues&lt;/code&gt; and &lt;code&gt;jms.durable.topic&lt;/code&gt; exchanges. The JMS client creates these automatically when your application creates queues and topics, so no manual exchange configuration is required.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/23/BDB-5769-2.png" alt="RabbitMQ management console showing jms.durable.queues and jms.durable.topic exchanges created automatically by the JMS client."&gt;&lt;/p&gt; 
&lt;h2&gt;Monitoring with Amazon CloudWatch&lt;/h2&gt; 
&lt;p&gt;Amazon MQ publishes broker metrics to &lt;a href="https://docs.aws.amazon.com/amazon-mq/latest/developer-guide/security-logging-monitoring.html" target="_blank" rel="noopener noreferrer"&gt;Amazon CloudWatch&lt;/a&gt; with no additional configuration needed. This gives you persistent monitoring and alerting that works alongside the rest of your AWS observability setup, beyond what the RabbitMQ management console provides in real time.&lt;/p&gt; 
&lt;p&gt;After your JMS messages reach the Amazon MQ for RabbitMQ broker, they’re transported as AMQP messages, which means &lt;a href="https://docs.aws.amazon.com/amazon-mq/latest/developer-guide/best-practices-rabbitmq.html" target="_blank" rel="noopener noreferrer"&gt;standard RabbitMQ operational best practices apply&lt;/a&gt;. Keep queue depth low to avoid memory pressure and consumer lag. Follow message durability and reliability guidelines to prevent message loss during broker restarts. For connection management, review broker setup and connection best practices to avoid connection churn.&lt;/p&gt; 
&lt;p&gt;&lt;a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/ConsoleAlarms.html" target="_blank" rel="noopener noreferrer"&gt;Set Amazon CloudWatch alarms&lt;/a&gt; on &lt;code&gt;MessageCount&lt;/code&gt; and &lt;code&gt;ConnectionCount&lt;/code&gt; first. A rising queue depth with a stable or dropping consumer count is an early signal of a processing bottleneck. A sudden drop in connections can indicate a client configuration issue that’s more straightforward to catch before it affects production traffic.&lt;/p&gt; 
&lt;h2&gt;Clean up&lt;/h2&gt; 
&lt;p&gt;To avoid ongoing charges after testing, delete the Amazon MQ broker and Secrets Manager secret using the AWS CLI.&lt;/p&gt; 
&lt;p&gt;Delete the broker:&lt;/p&gt; 
&lt;pre&gt;&lt;code class="lang-bash"&gt;aws mq delete-broker --broker-id &amp;lt;your-broker-id&amp;gt; --region us-west-2&lt;/code&gt;&lt;/pre&gt; 
&lt;p&gt;Delete the Secrets Manager secret:&lt;/p&gt; 
&lt;pre&gt;&lt;code class="lang-bash"&gt;aaws secretsmanager delete-secret \
--secret-id dev-rabbitmq \
--force-delete-without-recovery \
--region us-west-2&lt;/code&gt;&lt;/pre&gt; 
&lt;p&gt;Broker deletion is permanent and can’t be undone. Amazon MQ removes all messages, configurations, and user credentials. Leaving the broker running incurs hourly charges based on the instance type, plus storage costs for message data retained on the broker.&lt;/p&gt; 
&lt;h2&gt;Conclusion&lt;/h2&gt; 
&lt;p&gt;In this post, we walked you through how to migrate your JMS applications. We also walked through a complete setup, from creating the broker to sending and receiving messages. Migrating the broker is the straightforward part. The more significant question is what you do next. After your JMS application is running on Amazon MQ for RabbitMQ, you have access to native AMQP clients, which means non-Java services can start consuming the same messages without a JMS layer. A Java-centric messaging system becomes a shared event backbone that service can participate in. The migration is a starting point, not just a lift-and-shift.&lt;/p&gt; 
&lt;h2&gt;Next Steps&lt;/h2&gt; 
&lt;p&gt;To get started with your migration, &lt;a href="https://console.aws.amazon.com/amazon-mq" target="_blank" rel="noopener noreferrer"&gt;create your first Amazon MQ for RabbitMQ broker&lt;/a&gt;. For detailed technical guidance, see the &lt;a href="https://docs.aws.amazon.com/amazon-mq/latest/developer-guide/rabbitmq-jms-support.html" target="_blank" rel="noopener noreferrer"&gt;Amazon MQ Developer Guide&lt;/a&gt; and explore the &lt;a href="https://www.rabbitmq.com/client-libraries/jms-client" target="_blank" rel="noopener noreferrer"&gt;RabbitMQ JMS Client documentation&lt;/a&gt;.&lt;/p&gt; 
&lt;hr&gt; 
&lt;h2&gt;About the authors&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt;
  &lt;img loading="lazy" class="size-thumbnail wp-image-91737 alignleft" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/06/02/vinodh-100x133.jpg" alt="" width="100" height="133"&gt;
  &lt;a href="https://www.linkedin.com/in/vinodh-aws/" target="_blank" rel="noopener"&gt;Vinodh Kannan&lt;/a&gt; is a Senior Specialist Solutions Architect at Amazon Web Services (AWS). His expertise centers on AWS messaging and streaming services, where he provides architectural best practices consultation to AWS customers.
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt;
  &lt;img loading="lazy" class="alignleft wp-image-91736 size-thumbnail" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/06/02/akhil-100x133.jpeg" alt="Akhil Melakunta" width="100" height="133"&gt;
  &lt;a href="https://www.linkedin.com/in/akhilmelakunta/" target="_blank" rel="noopener"&gt;Akhil Melakunta&lt;/a&gt; is a Senior Solutions Architect at Amazon Web Services (AWS) with over 12 years of industry experience. His expertise spans AI/ML, serverless, and messaging services, where he guides enterprise customers through large-scale cloud transformations on AWS.
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt;
  &lt;img loading="lazy" class="wp-image-91865 size-full alignleft" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/06/04/vignesh.jpeg" alt="" width="120" height="160"&gt;
  &lt;a href="https://ca.linkedin.com/in/vigneshselvam28" target="_blank" rel="noopener"&gt;Vignesh Selvam&lt;/a&gt; is the Principal Product Manager for Amazon MQ at AWS. He works with customers to solve their messaging needs and with the open-source communities for innovating with message brokers. Prior to joining AWS, he built products for security and analytics.
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
		
		
			</item>
		<item>
		<title>Query Amazon Redshift using natural language with Kiro</title>
		<link>https://aws.amazon.com/blogs/big-data/query-amazon-redshift-using-natural-language-with-kiro/</link>
		
		<dc:creator><![CDATA[Hitesh Dodiya]]></dc:creator>
		<pubDate>Thu, 04 Jun 2026 22:06:00 +0000</pubDate>
				<category><![CDATA[Amazon Redshift]]></category>
		<category><![CDATA[Intermediate (200)]]></category>
		<category><![CDATA[Kiro]]></category>
		<category><![CDATA[Technical How-to]]></category>
		<guid isPermaLink="false">c3ff9cf32165df5b70a5f62e1c8cb10306c9b06d</guid>

					<description>In this post, you learn how to set up Kiro with the Amazon Redshift MCP server to query your data warehouse using natural language. You explore cluster discovery, schema browsing, analytical queries, cross-cluster comparisons, and data quality checks, all without writing SQL from scratch or switching between tools.</description>
										<content:encoded>&lt;p&gt;It’s Monday morning and your VP pings you: “Revenue dropped 15 percent over the weekend. What happened?” The clock starts. You open the AWS Management Console, find the right Amazon Redshift cluster, open the query editor, and start hunting. Which database has the revenue data, &lt;code&gt;analytics_db&lt;/code&gt; or &lt;code&gt;reporting_db&lt;/code&gt;? Is the table called &lt;code&gt;orders&lt;/code&gt;, &lt;code&gt;transactions&lt;/code&gt;, or &lt;code&gt;sales_events&lt;/code&gt;? You find it, but now you need the schema. Is the amount column &lt;code&gt;total_amount&lt;/code&gt;, &lt;code&gt;revenue&lt;/code&gt;, or &lt;code&gt;order_value&lt;/code&gt;? 20 minutes in, you haven’t answered the question. You’ve been navigating infrastructure.&lt;/p&gt; 
&lt;p&gt;This scenario plays out daily across data teams, and it’s why the landscape is shifting. With AI agents entering the analytics workflow, a growing number of business users can now perform complex data analysis. They no longer need to file a ticket with the data engineering team and wait days for a response. The bottleneck is no longer SQL expertise. It’s the friction between having a question and getting an answer.&lt;/p&gt; 
&lt;p&gt;The Amazon Redshift MCP server paired with Kiro removes that friction. Instead of memorizing cluster endpoints, reverse-engineering schemas, and hand-writing SQL, you describe what you need in plain text and get results. That Monday morning question becomes a single sentence: “Show me daily revenue for the past two weeks, broken down by region.” Kiro finds the cluster, discovers the schema, writes the query, and returns the answer in seconds, not minutes.&lt;/p&gt; 
&lt;p&gt;In this post, you learn how to:&lt;/p&gt; 
&lt;ol type="1"&gt; 
 &lt;li&gt;Install and configure Kiro with the Amazon Redshift MCP server.&lt;/li&gt; 
 &lt;li&gt;Discover clusters, databases, and schemas using natural language.&lt;/li&gt; 
 &lt;li&gt;Run analytical queries and cross-cluster comparisons conversationally.&lt;/li&gt; 
 &lt;li&gt;Implement security best practices for production Amazon Redshift environments.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;You can use Kiro in two forms: &lt;em&gt;Kiro integrated development environment (IDE)&lt;/em&gt;, a full desktop development environment, and &lt;em&gt;Kiro command line interface (CLI)&lt;/em&gt;, which brings the same AI capabilities directly to your terminal. The Redshift MCP server works with both. The CLI experience is particularly well suited for the conversational analytics workflow this post describes, because you can start querying your data warehouse from a terminal session without opening an IDE.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Important:&lt;/strong&gt; Before using this integration with production Amazon Redshift environments, read the Security tips section. This section covers critical considerations around AWS Identity and Access Management (IAM) permissions and Kiro autonomy modes.&lt;/p&gt; 
&lt;h2 id="what-is-the-amazon-redshift-mcp-server"&gt;What is the Amazon Redshift MCP server?&lt;/h2&gt; 
&lt;p&gt;The Model Context Protocol (MCP) is an open standard that provides AI agents with secure connections to external data sources and tools. The &lt;a href="https://awslabs.github.io/mcp/servers/redshift-mcp-server" target="_blank" rel="noopener"&gt;Amazon Redshift MCP server&lt;/a&gt; is an open source implementation that bridges the Kiro AI agent with your Amazon Redshift infrastructure.&lt;/p&gt; 
&lt;p&gt;With the Redshift MCP server, you can:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Automatically find both provisioned clusters and serverless workgroups with &lt;em&gt;cluster discovery&lt;/em&gt;.&lt;/li&gt; 
 &lt;li&gt;Browse databases, schemas, tables, and columns with &lt;em&gt;metadata exploration&lt;/em&gt;.&lt;/li&gt; 
 &lt;li&gt;Run SQL in READ ONLY mode with built-in safety protections with &lt;em&gt;safe query execution&lt;/em&gt;.&lt;/li&gt; 
 &lt;li&gt;Work with multiple clusters and workgroups simultaneously with &lt;em&gt;multi-cluster support&lt;/em&gt;.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;The server translates your natural language requests into the appropriate Amazon Redshift Data API calls and SQL queries. No manual endpoint configuration or SQL writing is required.&lt;/p&gt; 
&lt;h3&gt;&lt;strong&gt;How the Redshift MCP server relates to the AWS MCP server&lt;/strong&gt;&lt;/h3&gt; 
&lt;p&gt;You might have noticed that AWS also offers the AWS MCP server (part of the Agent Toolkit for AWS), which provides broad access to AWS services, including the Redshift Data API. A common question is: if the AWS MCP server can already reach Redshift, why add a dedicated Redshift MCP server?&lt;/p&gt; 
&lt;p&gt;The two are complementary, not competing. The AWS MCP server gives Kiro general AWS capabilities (service decision guides, SDK usage guidance, troubleshooting skills, and access to AWS APIs). The Redshift MCP server adds a purpose-built analytics layer on top. It provides single-call query execution (compared to a minimum of three API calls for submit, poll, and fetch), read-only safety by default, transparent provisioned and serverless cluster handling, and dedicated metadata navigation tools. Upcoming features like query plan explanation, native identity propagation, cluster analysis, and UDF discovery will further extend this specialized layer.&lt;/p&gt; 
&lt;p&gt;You can use both together, or use the Amazon Redshift MCP server on its own. There’s no either-or requirement.&lt;/p&gt; 
&lt;h2 id="setting-it-up"&gt;Setting it up&lt;/h2&gt; 
&lt;p&gt;The following sections walk you through the installation and configuration process.&lt;/p&gt; 
&lt;h3 id="prerequisites"&gt;Prerequisites&lt;/h3&gt; 
&lt;p&gt;Before you begin, make sure that you have:&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;On your machine:&lt;/strong&gt;&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Kiro IDE or Kiro CLI installed.&lt;/li&gt; 
 &lt;li&gt;Python 3.10 or newer.&lt;/li&gt; 
 &lt;li&gt;The uv package manager from Astral.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;&lt;strong&gt;On AWS:&lt;/strong&gt;&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;AWS credentials configured through the AWS Command Line Interface (AWS CLI), environment variables, or IAM roles.&lt;/li&gt; 
 &lt;li&gt;At least one Amazon Redshift provisioned cluster or serverless workgroup.&lt;/li&gt; 
 &lt;li&gt;IAM permissions for Amazon Redshift access (see the following section).&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h3 id="step-1-install-the-uv-package-manager"&gt;Step 1: Install the uv package manager&lt;/h3&gt; 
&lt;p&gt;If you don’t have uv installed, run one of the following commands.&lt;/p&gt; 
&lt;p&gt;For macOS or Linux:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-bash"&gt;curl -LsSf https://astral.sh/uv/install.sh | sh&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;For Windows PowerShell:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-powershell"&gt;powershell -c "irm https://astral.sh/uv/install.ps1 | iex"&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Then install Python 3.10 or newer if needed:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-bash"&gt;uv python install 3.10&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;h3 id="step-2-configure-iam-permissions"&gt;Step 2: Configure IAM permissions&lt;/h3&gt; 
&lt;p&gt;Your AWS identity needs the following permissions. Attach this policy to your IAM user or role:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-json"&gt;{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "redshift:DescribeClusters",
        "redshift:GetClusterCredentialsWithIAM",
        "redshift:GetClusterCredentials",
        "redshift-serverless:ListWorkgroups",
        "redshift-serverless:GetWorkgroup",
        "redshift-serverless:GetCredentials",
        "redshift-data:ExecuteStatement",
        "redshift-data:DescribeStatement",
        "redshift-data:GetStatementResult"
      ],
      "Resource": "*"
    }
  ]
}&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;You also need database-level permissions: &lt;code&gt;SELECT&lt;/code&gt; on tables you want to query, &lt;code&gt;USAGE&lt;/code&gt; on schemas you want to explore, and connection access to the target databases.&lt;/p&gt; 
&lt;h3 id="step-3-configure-the-mcp-server-in-kiro"&gt;Step 3: Configure the MCP server in Kiro&lt;/h3&gt; 
&lt;p&gt;Open (or create) your Kiro MCP configuration file and add the Amazon Redshift server.&lt;/p&gt; 
&lt;p&gt;For Kiro IDE:&lt;/p&gt; 
&lt;ol type="1"&gt; 
 &lt;li&gt;User-level configuration (applies globally): &lt;code&gt;~/.kiro/settings/mcp.json&lt;/code&gt;.&lt;/li&gt; 
 &lt;li&gt;Workspace-level configuration (applies to a specific project): &lt;code&gt;.kiro/settings/mcp.json&lt;/code&gt;.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;For Kiro CLI:&lt;/p&gt; 
&lt;ol type="1"&gt; 
 &lt;li&gt;User-level configuration: &lt;code&gt;~/.kiro/settings/mcp.json&lt;/code&gt;.&lt;/li&gt; 
 &lt;li&gt;Workspace-level configuration: &lt;code&gt;.kiro/settings/mcp.json&lt;/code&gt; in your project directory.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;The configuration format is the same for both. Add the following:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-json"&gt;{
  "mcpServers": {
    "awslabs.redshift-mcp-server": {
      "command": "uvx",
      "args": ["awslabs.redshift-mcp-server@latest"],
      "env": {
        "AWS_PROFILE": "default",
        "AWS_DEFAULT_REGION": "us-east-1",
        "FASTMCP_LOG_LEVEL": "ERROR"
      },
      "disabled": false,
      "autoApprove": []
    }
  }
}&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;&lt;strong&gt;Tip:&lt;/strong&gt; Replace &lt;code&gt;AWS_PROFILE&lt;/code&gt; and &lt;code&gt;AWS_DEFAULT_REGION&lt;/code&gt; with values matching your environment. If you use &lt;code&gt;AWS_REGION&lt;/code&gt; instead of &lt;code&gt;AWS_DEFAULT_REGION&lt;/code&gt;, &lt;code&gt;AWS_REGION&lt;/code&gt; takes higher precedence.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Windows users: use &lt;code&gt;uv&lt;/code&gt; instead of &lt;code&gt;uvx&lt;/code&gt; with additional args. See the &lt;a href="https://awslabs.github.io/mcp/servers/redshift-mcp-server" target="_blank" rel="noopener"&gt;Amazon Redshift MCP server documentation&lt;/a&gt; for the Windows-specific configuration.&lt;/p&gt; 
&lt;h3 id="step-4-verify-the-connection"&gt;Step 4: Verify the connection&lt;/h3&gt; 
&lt;p&gt;In Kiro IDE, restart Kiro (or reconnect MCP servers from the MCP Server view in the Kiro feature panel). Then open the Kiro chat and enter the following prompt:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-plaintext"&gt;List my available Redshift clusters&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;If everything is configured correctly, Kiro discovers your clusters and lists them with details like cluster identifiers, status, node types, and endpoints.&lt;/p&gt; 
&lt;h2 id="practical-examples"&gt;Practical examples&lt;/h2&gt; 
&lt;p&gt;The following scenarios show how Kiro and the Amazon Redshift MCP server work together in real-world workflows.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Tip:&lt;/strong&gt; If you want Kiro to show the SQL that it executes alongside every result, tell it at the beginning of your session:&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;For every query you run against Amazon Redshift, show me the exact SQL you executed along with the results.&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;Kiro follows this instruction for the rest of the conversation. For a permanent setup, add this rule to a steering file (see the Tips section that follows) so you don’t need to repeat it. The following examples reflect this behavior.&lt;/p&gt; 
&lt;h3 id="example-1-discover-and-explore-your-infrastructure"&gt;Example 1: Discover and explore your infrastructure&lt;/h3&gt; 
&lt;p&gt;Without Kiro: Log in to the AWS Management Console, navigate to Amazon Redshift, choose clusters, open the query editor, sign in to each database, and manually browse schemas.&lt;/p&gt; 
&lt;p&gt;With Kiro:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-plaintext"&gt;What Amazon Redshift clusters and serverless workgroups do I have?&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Kiro responds with a structured overview:&lt;/p&gt; 
&lt;p&gt;Found two provisioned clusters and one serverless workgroup:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;analytics-cluster (provisioned, 2x ra3.xlplus nodes, available)&lt;/li&gt; 
 &lt;li&gt;reporting-cluster (provisioned, 4x ra3.4xlarge nodes, available)&lt;/li&gt; 
 &lt;li&gt;etl-workgroup (serverless, available)&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;Then drill deeper:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-plaintext"&gt;What databases and schemas exist in analytics-cluster?&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Kiro uses the MCP metadata tools to explore the cluster. No SQL is needed for discovery:&lt;/p&gt; 
&lt;p&gt;Database: &lt;code&gt;analytics_db&lt;/code&gt; (local)&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;code&gt;information_schema&lt;/code&gt; — system schema&lt;/li&gt; 
 &lt;li&gt;&lt;code&gt;pg_catalog&lt;/code&gt; — system schema&lt;/li&gt; 
 &lt;li&gt;&lt;code&gt;public&lt;/code&gt; — default schema&lt;/li&gt; 
 &lt;li&gt;&lt;code&gt;sales&lt;/code&gt; — user-created schema&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;Database: &lt;code&gt;dev&lt;/code&gt; (local)&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;code&gt;information_schema&lt;/code&gt; — system schema&lt;/li&gt; 
 &lt;li&gt;&lt;code&gt;pg_catalog&lt;/code&gt; — system schema&lt;/li&gt; 
 &lt;li&gt;&lt;code&gt;public&lt;/code&gt; — default schema&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;The only user-created schema is &lt;code&gt;sales&lt;/code&gt; in the &lt;code&gt;analytics_db&lt;/code&gt; database.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-plaintext"&gt;Show me the tables in the sales schema of the analytics database&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;There are five base tables in &lt;code&gt;analytics_db.sales&lt;/code&gt;:&lt;/p&gt; 
&lt;ol type="1"&gt; 
 &lt;li&gt;&lt;code&gt;customers&lt;/code&gt;&lt;/li&gt; 
 &lt;li&gt;&lt;code&gt;order_items&lt;/code&gt;&lt;/li&gt; 
 &lt;li&gt;&lt;code&gt;orders&lt;/code&gt;&lt;/li&gt; 
 &lt;li&gt;&lt;code&gt;products&lt;/code&gt;&lt;/li&gt; 
 &lt;li&gt;&lt;code&gt;returns&lt;/code&gt;&lt;/li&gt; 
&lt;/ol&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-plaintext"&gt;Describe the columns in the orders table&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;table border="1px" width="100%" cellpadding="10px"&gt; 
 &lt;tbody&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;strong&gt;Column&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Type&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Nullable&lt;/strong&gt;&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;code&gt;order_id&lt;/code&gt;&lt;/td&gt; 
   &lt;td&gt;INTEGER&lt;/td&gt; 
   &lt;td&gt;NO&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;code&gt;customer_id&lt;/code&gt;&lt;/td&gt; 
   &lt;td&gt;INTEGER&lt;/td&gt; 
   &lt;td&gt;NO&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;code&gt;order_date&lt;/code&gt;&lt;/td&gt; 
   &lt;td&gt;TIMESTAMP&lt;/td&gt; 
   &lt;td&gt;NO&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;code&gt;total_amount&lt;/code&gt;&lt;/td&gt; 
   &lt;td&gt;numeric(12,2)&lt;/td&gt; 
   &lt;td&gt;NO&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;code&gt;status&lt;/code&gt;&lt;/td&gt; 
   &lt;td&gt;VARCHAR(20)&lt;/td&gt; 
   &lt;td&gt;YES&lt;/td&gt; 
  &lt;/tr&gt; 
 &lt;/tbody&gt; 
&lt;/table&gt; 
&lt;p&gt;Each request builds on the previous context. Kiro remembers which cluster you’re exploring and navigates the hierarchy for you.&lt;/p&gt; 
&lt;h3 id="example-2-on-demand-data-analysis-in-plain-english"&gt;Example 2: On-demand data analysis in plain English&lt;/h3&gt; 
&lt;p&gt;Without Kiro: Write SQL from scratch, guess at column names, run the query, fix syntax errors, iterate.&lt;/p&gt; 
&lt;p&gt;With Kiro:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-plaintext"&gt;Show me the top 10 customers by total purchase amount from the analytics-cluster,
including how many orders each customer placed&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Kiro does the following:&lt;/p&gt; 
&lt;ol type="1"&gt; 
 &lt;li&gt;Identifies the relevant cluster and database.&lt;/li&gt; 
 &lt;li&gt;Explores the schema to find customer and order tables.&lt;/li&gt; 
 &lt;li&gt;Understands the table relationships.&lt;/li&gt; 
 &lt;li&gt;Generates and runs the appropriate SQL, showing you the query first.&lt;/li&gt; 
 &lt;li&gt;Presents the results in a readable format.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;The response includes the query and results:&lt;/p&gt; 
&lt;p&gt;Query executed:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-sql"&gt;SELECT
    c.customer_id,
    c.customer_name,
    COUNT(o.order_id) AS total_orders,
    SUM(o.total_amount) AS total_spent
FROM sales.customers c
JOIN sales.orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.customer_name
ORDER BY total_spent DESC
LIMIT 10;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;table border="1px" width="100%" cellpadding="10px"&gt; 
 &lt;tbody&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;strong&gt;customer_id&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;customer_name&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;total_orders&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;total_spent&lt;/strong&gt;&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;4821&lt;/td&gt; 
   &lt;td&gt;Acme Corp&lt;/td&gt; 
   &lt;td&gt;347&lt;/td&gt; 
   &lt;td&gt;$1,284,500.00&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;1190&lt;/td&gt; 
   &lt;td&gt;Globex Inc&lt;/td&gt; 
   &lt;td&gt;289&lt;/td&gt; 
   &lt;td&gt;$1,102,340.00&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;7734&lt;/td&gt; 
   &lt;td&gt;Initech LLC&lt;/td&gt; 
   &lt;td&gt;256&lt;/td&gt; 
   &lt;td&gt;$987,210.00&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;…&lt;/td&gt; 
   &lt;td&gt;…&lt;/td&gt; 
   &lt;td&gt;…&lt;/td&gt; 
   &lt;td&gt;…&lt;/td&gt; 
  &lt;/tr&gt; 
 &lt;/tbody&gt; 
&lt;/table&gt; 
&lt;p&gt;You didn’t need to remember table names, column names, or join conditions. Kiro figured it out by exploring the metadata first. You can see exactly what it ran, copy the SQL, and reuse it.&lt;/p&gt; 
&lt;h3 id="example-3-cross-cluster-data-comparison"&gt;Example 3: Cross-cluster data comparison&lt;/h3&gt; 
&lt;p&gt;Without Kiro: Open two SQL clients, sign in to different clusters, run similar queries, manually compare results in a spreadsheet.&lt;/p&gt; 
&lt;p&gt;With Kiro:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-plaintext"&gt;Compare the total sales revenue and transaction counts between
analytics-cluster and etl-workgroup for the last quarter.
Are there any discrepancies?&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Kiro queries both environments, aligns the results, and provides a comparison, showing every query it runs along the way.&lt;/p&gt; 
&lt;p&gt;Query executed on &lt;code&gt;analytics-cluster&lt;/code&gt;:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-sql"&gt;SELECT
    COUNT(*) AS total_records,
    SUM(total_amount) AS total_revenue,
    COUNT(CASE WHEN order_date &amp;gt;= DATE_TRUNC('quarter', CURRENT_DATE - INTERVAL '3 months')
        THEN 1 END) AS q4_transactions
FROM sales.orders;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Query executed on &lt;code&gt;etl-workgroup&lt;/code&gt;:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-sql"&gt;SELECT
    COUNT(*) AS total_records,
    SUM(total_amount) AS total_revenue,
    COUNT(CASE WHEN order_date &amp;gt;= DATE_TRUNC('quarter', CURRENT_DATE - INTERVAL '3 months')
        THEN 1 END) AS q4_transactions
FROM sales.orders;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Sales data comparison: Analytics vs ETL.&lt;/p&gt; 
&lt;table border="1px" width="100%" cellpadding="10px"&gt; 
 &lt;tbody&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;strong&gt;Metric&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Analytics Cluster&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;ETL Workgroup&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Status&lt;/strong&gt;&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Total Records&lt;/td&gt; 
   &lt;td&gt;172,456&lt;/td&gt; 
   &lt;td&gt;172,456&lt;/td&gt; 
   &lt;td&gt;Match&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Total Revenue&lt;/td&gt; 
   &lt;td&gt;$110,765,431&lt;/td&gt; 
   &lt;td&gt;$110,765,431&lt;/td&gt; 
   &lt;td&gt;Match&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Q4 Transactions&lt;/td&gt; 
   &lt;td&gt;38,061&lt;/td&gt; 
   &lt;td&gt;38,061&lt;/td&gt; 
   &lt;td&gt;Match&lt;/td&gt; 
  &lt;/tr&gt; 
 &lt;/tbody&gt; 
&lt;/table&gt; 
&lt;p&gt;Because both queries are visible, you can verify that Kiro compared the same metrics on both sides, with no hidden assumptions.&lt;/p&gt; 
&lt;h3 id="example-4-schema-documentation-on-demand"&gt;Example 4: Schema documentation on demand&lt;/h3&gt; 
&lt;p&gt;Without Kiro: Manually document table structures in a wiki or spreadsheet. Documentation frequently falls out of date.&lt;/p&gt; 
&lt;p&gt;With Kiro:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-plaintext"&gt;Generate documentation for the tables in the sales schema of analytics-cluster.
Include column names, data types, and notable relationships.&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Kiro explores every table using the discovery tools and produces structured documentation.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Table:&lt;/strong&gt; &lt;code&gt;sales.customers&lt;/code&gt;&lt;/p&gt; 
&lt;table border="1px" width="100%" cellpadding="10px"&gt; 
 &lt;tbody&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;strong&gt;Column&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Type&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Nullable&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Description&lt;/strong&gt;&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;code&gt;customer_id&lt;/code&gt;&lt;/td&gt; 
   &lt;td&gt;INTEGER&lt;/td&gt; 
   &lt;td&gt;NO&lt;/td&gt; 
   &lt;td&gt;Primary key&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;code&gt;customer_name&lt;/code&gt;&lt;/td&gt; 
   &lt;td&gt;VARCHAR(255)&lt;/td&gt; 
   &lt;td&gt;NO&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;code&gt;email&lt;/code&gt;&lt;/td&gt; 
   &lt;td&gt;VARCHAR(255)&lt;/td&gt; 
   &lt;td&gt;YES&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;code&gt;created_at&lt;/code&gt;&lt;/td&gt; 
   &lt;td&gt;TIMESTAMP&lt;/td&gt; 
   &lt;td&gt;NO&lt;/td&gt; 
  &lt;/tr&gt; 
 &lt;/tbody&gt; 
&lt;/table&gt; 
&lt;p&gt;&lt;strong&gt;Table:&lt;/strong&gt; &lt;code&gt;sales.orders&lt;/code&gt;&lt;/p&gt; 
&lt;table border="1px" width="100%" cellpadding="10px"&gt; 
 &lt;tbody&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;strong&gt;Column&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Type&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Nullable&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Description&lt;/strong&gt;&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;code&gt;order_id&lt;/code&gt;&lt;/td&gt; 
   &lt;td&gt;INTEGER&lt;/td&gt; 
   &lt;td&gt;NO&lt;/td&gt; 
   &lt;td&gt;Primary key&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;code&gt;customer_id&lt;/code&gt;&lt;/td&gt; 
   &lt;td&gt;INTEGER&lt;/td&gt; 
   &lt;td&gt;NO&lt;/td&gt; 
   &lt;td&gt;FK to customers&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;code&gt;order_date&lt;/code&gt;&lt;/td&gt; 
   &lt;td&gt;TIMESTAMP&lt;/td&gt; 
   &lt;td&gt;NO&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;code&gt;total_amount&lt;/code&gt;&lt;/td&gt; 
   &lt;td&gt;numeric(12,2)&lt;/td&gt; 
   &lt;td&gt;NO&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;code&gt;status&lt;/code&gt;&lt;/td&gt; 
   &lt;td&gt;VARCHAR(20)&lt;/td&gt; 
   &lt;td&gt;YES&lt;/td&gt; 
  &lt;/tr&gt; 
 &lt;/tbody&gt; 
&lt;/table&gt; 
&lt;p&gt;(continues for all tables in the schema…)&lt;/p&gt; 
&lt;p&gt;You can paste this directly into your team wiki or README, and anyone can rerun the same discovery flow to refresh the docs later.&lt;/p&gt; 
&lt;h3 id="example-5-troubleshooting-data-quality"&gt;Example 5: Troubleshooting data quality&lt;/h3&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-plaintext"&gt;Check the orders table in analytics-cluster for any NULL values
in the customer_id or total_amount columns.
How many records are affected?&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Query executed:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-sql"&gt;SELECT
    COUNT(*) FILTER (WHERE customer_id IS NULL) AS null_customer_ids,
    COUNT(*) FILTER (WHERE total_amount IS NULL) AS null_total_amounts,
    COUNT(*) AS total_rows
FROM sales.orders;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Out of 172,456 rows: 0 NULL &lt;code&gt;customer_ids&lt;/code&gt;, 3 NULL &lt;code&gt;total_amounts&lt;/code&gt;.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-plaintext"&gt;Find duplicate order_ids in the sales.orders table and show me
a few examples with their timestamps&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Query executed:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-sql"&gt;SELECT order_id, COUNT(*) AS occurrences, MIN(order_date) AS first_seen, MAX(order_date) AS last_seen
FROM sales.orders
GROUP BY order_id
HAVING COUNT(*) &amp;gt; 1
ORDER BY occurrences DESC
LIMIT 5;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Found two duplicate &lt;code&gt;order_ids&lt;/code&gt;: order 88412 (two occurrences), order 91003 (two occurrences).&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-plaintext"&gt;What's the date range of data in the events table?
Are there any gaps in the daily record counts?&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Query executed:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-sql"&gt;WITH daily_counts AS (
    SELECT DATE(event_date) AS day, COUNT(*) AS record_count
    FROM sales.events
    GROUP BY DATE(event_date)
),
date_range AS (
    SELECT MIN(day) AS start_date, MAX(day) AS end_date FROM daily_counts
),
all_dates AS (
    SELECT (start_date + (n || ' days')::INTERVAL)::DATE AS day
    FROM date_range, generate_series(0, (end_date - start_date)) AS n
)
SELECT a.day AS missing_date
FROM all_dates a
LEFT JOIN daily_counts d ON a.day = d.day
WHERE d.day IS NULL
ORDER BY a.day;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Date range: 2024-01-01 to 2025-04-20. Found three missing dates: 2024-03-15, 2024-07-04, 2024-12-25 (likely holidays).&lt;/p&gt; 
&lt;p&gt;Every query is right there in the response. You can copy them into your own SQL client, modify them, or save them as reusable scripts.&lt;/p&gt; 
&lt;h2 id="tips-for-getting-the-most-out-of-kiro-and-redshift"&gt;Tips for getting the most out of Kiro and Redshift&lt;/h2&gt; 
&lt;ol type="1"&gt; 
 &lt;li&gt;&lt;strong&gt;Start with discovery.&lt;/strong&gt; Begin each session by asking Kiro to list your clusters and explore the database structure. This gives the agent context for subsequent queries.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Be specific about which cluster.&lt;/strong&gt; If you have multiple clusters, mention the cluster name in your request to avoid ambiguity.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Iterate gradually.&lt;/strong&gt; Start with simple questions and build complexity. Ask for a count before asking for a full breakdown.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Use steering files for team conventions.&lt;/strong&gt; Create a &lt;code&gt;.kiro/steering/redshift.md&lt;/code&gt; file in your project with details about your cluster naming conventions, important schemas, and common query patterns. This gives Kiro persistent context about your environment.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;Example steering file:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-markdown"&gt;---
inclusion: auto
---

# Redshift Environment Context

## Clusters

- **analytics-cluster**: Primary analytics warehouse. Use database `analytics_db`.
- **etl-workgroup**: Serverless workgroup for ETL pipelines. Use database `staging_db`.

## Key Schemas

- `sales`: Customer transactions, orders, and revenue data
- `marketing`: Campaign performance and attribution data

## Conventions

- Always filter by `is_deleted = false` on soft-delete tables
- Date columns use `TIMESTAMP WITHOUT TIME ZONE` in UTC

## Query Transparency

- Always show the exact SQL query text being executed before or alongside the results.
  Users should be able to see, verify, and reuse every query that Kiro runs against Redshift.&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;That last Query Transparency rule is a small addition with a big impact. By default, Kiro might summarize results without showing the underlying SQL. Adding this steering instruction makes every query visible, which helps maintain consistent behavior across your data team and supports auditing, learning, and trust.&lt;/p&gt; 
&lt;ol start="5" type="1"&gt; 
 &lt;li&gt;&lt;strong&gt;Use hooks for automation.&lt;/strong&gt; Set up agent hooks to run common validation queries automatically. For example, trigger a data quality check whenever you edit a specific SQL file.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Verify important results.&lt;/strong&gt; Always cross-check critical business findings with stakeholders before acting on them.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;h2 id="security-tips"&gt;Security tips&lt;/h2&gt; 
&lt;p&gt;When configured with least privilege IAM policies and Supervised mode, the Redshift MCP server provides multiple layers of protection. Under the AWS shared responsibility model, you are responsible for configuring access controls appropriately.&lt;/p&gt; 
&lt;h3 id="safety-layers-at-a-glance"&gt;Safety layers at a glance&lt;/h3&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Data stays in your account.&lt;/strong&gt; The MCP server runs locally, queries run inside your Amazon Redshift cluster, and no data is sent to third-party services.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Content not used for training.&lt;/strong&gt; Enterprise users are excluded from service improvement usage. Free or individual tier users can opt out in Kiro settings.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Read-only by default.&lt;/strong&gt; Every query is wrapped in a read-only transaction, preventing accidental writes.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Standard IAM controls.&lt;/strong&gt; Kiro only gets permissions you’ve explicitly granted.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Supervised mode.&lt;/strong&gt; Review each agent action before it’s applied in production environments.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;&lt;strong&gt;Important caveat:&lt;/strong&gt; The read-only enforcement only applies to queries routed through the MCP server’s &lt;code&gt;execute_query&lt;/code&gt; tool. Kiro also has shell access. If your IAM credentials have write permissions, direct CLI calls (&lt;code&gt;aws redshift-data execute-statement&lt;/code&gt;) bypass this guardrail entirely.&lt;/p&gt; 
&lt;p&gt;Takeaway: Use least privilege IAM policies (scoped to read and describe operations only) as your primary defense. Avoid broad policies like &lt;code&gt;AdministratorAccess&lt;/code&gt; or &lt;code&gt;AmazonRedshiftFullAccess&lt;/code&gt;. When IAM is properly scoped, even a direct CLI call cannot perform writes.&lt;/p&gt; 
&lt;h3 id="verify-kiros-output"&gt;Verify Kiro’s output&lt;/h3&gt; 
&lt;p&gt;Kiro is a powerful assistant, but it’s not infallible. Like any AI tool, it can misinterpret your intent, generate incorrect SQL, or present results that look plausible but are wrong. Examples include a misplaced join, a wrong filter, or an aggregation that silently excludes rows. This is especially important when working with production data where decisions have real business impact.&lt;/p&gt; 
&lt;p&gt;Treat Kiro’s output as a strong starting point, not a final answer. Review the SQL it generates before acting on the results. Use the Query Transparency steering rule (described in the Tips section) so you can see the exact query behind every result. When findings inform business decisions, validate them independently by running the query yourself, cross-checking with a colleague, or comparing against a known baseline.&lt;/p&gt; 
&lt;h3 id="summary"&gt;Summary&lt;/h3&gt; 
&lt;table border="1px" width="100%" cellpadding="10px"&gt; 
 &lt;tbody&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;strong&gt;Layer&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;What it protects against&lt;/strong&gt;&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;MCP server read-only mode&lt;/td&gt; 
   &lt;td&gt;Accidental writes through the MCP execute_query tool&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Least privilege IAM policy&lt;/td&gt; 
   &lt;td&gt;Write operations via any path, including direct CLI calls&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Kiro Supervised mode&lt;/td&gt; 
   &lt;td&gt;Unreviewed autonomous actions by the agent&lt;/td&gt; 
  &lt;/tr&gt; 
 &lt;/tbody&gt; 
&lt;/table&gt; 
&lt;p&gt;Defense in depth: use these layers together for production environments.&lt;/p&gt; 
&lt;h2 id="what-you-can-achieve-with-kiro-and-amazon-redshift"&gt;What you can achieve with Kiro and Amazon Redshift&lt;/h2&gt; 
&lt;table border="1px" width="100%" cellpadding="10px"&gt; 
 &lt;tbody&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;strong&gt;Before&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Now you can&lt;/strong&gt;&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Switch between the console, SQL clients, and documentation&lt;/td&gt; 
   &lt;td&gt;Use one interface for discovery, querying, and analysis&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Memorize cluster endpoints, database names, and schemas&lt;/td&gt; 
   &lt;td&gt;Ask in plain text and let Kiro discover the structure&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Write SQL from scratch for every query&lt;/td&gt; 
   &lt;td&gt;Describe what you want and get results&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Manually compare data across clusters&lt;/td&gt; 
   &lt;td&gt;Run single-sentence cross-cluster analysis&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Schema documentation is frequently stale&lt;/td&gt; 
   &lt;td&gt;Generate fresh docs on demand&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Onboarding new analysts takes days&lt;/td&gt; 
   &lt;td&gt;New team members can explore immediately&lt;/td&gt; 
  &lt;/tr&gt; 
 &lt;/tbody&gt; 
&lt;/table&gt; 
&lt;p&gt;Every minute you spend hunting for a table name or debugging a SQL syntax error is a minute that you’re not spending on actual analysis. You can reduce that overhead by letting Kiro handle the mechanical parts (discovery, navigation, and query construction) so you can focus on the questions that matter to your business.&lt;/p&gt; 
&lt;h2 id="cleaning-up"&gt;Cleaning up&lt;/h2&gt; 
&lt;p&gt;If you created Amazon Redshift resources specifically for this walkthrough, or if you no longer need the MCP server integration, follow these steps. They help you avoid ongoing charges and remove the configuration.&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Remove the MCP server configuration.&lt;/li&gt; 
 &lt;li&gt;Detach the IAM policy.&lt;/li&gt; 
 &lt;li&gt;Delete test Amazon Redshift resources (if applicable).&lt;/li&gt; 
 &lt;li&gt;Uninstall uv (optional).&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2 id="conclusion"&gt;Conclusion&lt;/h2&gt; 
&lt;p&gt;In this post, you learned how to set up Kiro with the Amazon Redshift MCP server to query your data warehouse using natural language. You explored cluster discovery, schema browsing, analytical queries, cross-cluster comparisons, and data quality checks, all without writing SQL from scratch or switching between tools.&lt;/p&gt; 
&lt;p&gt;To go further:&lt;/p&gt; 
&lt;ol type="1"&gt; 
 &lt;li&gt;New to Amazon Redshift? &lt;a href="https://aws.amazon.com/redshift/" target="_blank" rel="noopener"&gt;Get started with Amazon Redshift&lt;/a&gt; to create your first cluster or serverless workgroup.&lt;/li&gt; 
 &lt;li&gt;Read the &lt;a href="https://awslabs.github.io/mcp/" target="_blank" rel="noopener"&gt;MCP protocol specification&lt;/a&gt; to understand how AI agents work with external tools.&lt;/li&gt; 
 &lt;li&gt;Visit &lt;a href="https://kiro.dev" target="_blank" rel="noopener"&gt;kiro.dev&lt;/a&gt; for Kiro’s full capabilities, including specs, hooks, and steering files.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;As you get comfortable with the basics, try combining steering files with agent hooks to automate recurring workflows like daily data quality checks or weekly schema documentation refreshes.&lt;/p&gt; 
&lt;hr&gt; 
&lt;h2&gt;About the author&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt; 
   &lt;p&gt;&lt;img loading="lazy" class="alignleft size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/06/03/BDB-5971-1.png" alt="Hitesh Dodiya" width="100" height="100"&gt;&lt;/p&gt; 
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Hitesh Dodiya&lt;/h3&gt; 
  &lt;p&gt;Hitesh is a Database Engineer at Amazon Redshift with over a decade of experience in databases, data warehousing, and analytics. He is passionate about leveraging Gen AI to deliver innovative solutions that solve real customer problems.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
		
		
			</item>
		<item>
		<title>Build governance dashboards for Amazon SageMaker Catalog with Amazon Quick</title>
		<link>https://aws.amazon.com/blogs/big-data/build-governance-dashboards-for-amazon-sagemaker-catalog-with-amazon-quick/</link>
					
		
		<dc:creator><![CDATA[Steve Phillips]]></dc:creator>
		<pubDate>Thu, 04 Jun 2026 15:41:17 +0000</pubDate>
				<category><![CDATA[Amazon Quick Suite]]></category>
		<category><![CDATA[Amazon SageMaker Data & AI Governance]]></category>
		<category><![CDATA[Amazon SageMaker Unified Studio]]></category>
		<category><![CDATA[Intermediate (200)]]></category>
		<category><![CDATA[Technical How-to]]></category>
		<guid isPermaLink="false">f38f4e72cbf4d1ce688d0f71f8092d9d93292b7e</guid>

					<description>In a previous post, we showed you how to query Amazon SageMaker Catalog metadata using SQL by using the metadata export feature. This post builds on that foundation by demonstrating how to create governance dashboards with Amazon Quick.</description>
										<content:encoded>&lt;p&gt;Maintaining visibility into your data catalog’s health requires more than ad-hoc queries. Data stewards and compliance teams need automated dashboards that surface governance metrics and alert them when issues arise. These issues include undocumented assets, missing ownership, and stale metadata.&lt;/p&gt; 
&lt;p&gt;In a previous post, we showed you how to query &lt;a href="https://aws.amazon.com/sagemaker/catalog/" target="_blank" rel="noopener noreferrer"&gt;Amazon SageMaker Catalog&lt;/a&gt; metadata using SQL by using the metadata export feature. This post builds on that foundation by demonstrating how to create governance dashboards with &lt;a href="https://aws.amazon.com/quick/" target="_blank" rel="noopener noreferrer"&gt;Amazon Quick&lt;/a&gt;.&lt;/p&gt; 
&lt;p&gt;Amazon Quick is an agentic AI-powered digital workspace that provides integrated analytics, automation, and research capabilities. With &lt;a href="https://aws.amazon.com/quick/quicksight/" target="_blank" rel="noopener noreferrer"&gt;Amazon Quick Sight&lt;/a&gt;, a component of Amazon Quick, you can create interactive dashboards and visualizations with automatic chart suggestions and machine learning (ML) insights.&lt;/p&gt; 
&lt;p&gt;We walk through how to connect Amazon Quick Sight to your Amazon SageMaker Catalog metadata and build governance dashboards using natural language prompts.&lt;/p&gt; 
&lt;h2&gt;&lt;strong&gt;Solution overview&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;This solution extends the metadata export architecture by adding a visualization layer:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;Amazon SageMaker Catalog exports asset metadata daily to Amazon Simple Storage Service (Amazon S3) Tables&lt;/li&gt; 
 &lt;li&gt;Amazon Athena queries the metadata using standard SQL&lt;/li&gt; 
 &lt;li&gt;Amazon Quick Sight connects to Athena for interactive dashboards&lt;/li&gt; 
 &lt;li&gt;Amazon Quick uses natural language to build visualizations&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90836" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5863-image-1.jpeg" alt="AWS Cloud architecture diagram showing the data flow for SageMaker Catalog metadata visualization. Amazon SageMaker Catalog exports asset metadata and daily exports to Amazon S3 Tables in a bucket named aws-sagemaker-catalog. Amazon Athena queries the S3 Tables data using SQL queries. Amazon Quick connects to Athena to provide interactive dashboards, natural language queries, and executive summaries. Arrows indicate the left-to-right data flow from SageMaker Catalog to S3 Tables to Athena, with Athena connecting down to Amazon Quick." width="1411" height="636"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;Figure 1 – Amazon SageMaker Catalog governance dashboard architecture&lt;/em&gt;&lt;/p&gt; 
&lt;h2&gt;&lt;strong&gt;Prerequisites&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;Before you begin, complete the following steps from &lt;a href="https://aws.amazon.com/blogs/big-data/analyzing-your-data-catalog-query-sagemaker-catalog-metadata-with-sql/" target="_blank" rel="noopener noreferrer"&gt;Analyzing your data catalog: Query SageMaker Catalog metadata with SQL&lt;/a&gt;. You must also have the following:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Amazon SageMaker Catalog metadata export enabled&lt;/li&gt; 
 &lt;li&gt;Amazon Athena configured with query results S3 bucket&lt;/li&gt; 
 &lt;li&gt;AWS Lake Formation permissions configured for AWS Identity and Access Management (IAM)-based access&lt;/li&gt; 
 &lt;li&gt;Verified that the &lt;code&gt;asset_metadata.asset&lt;/code&gt; table contains data&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;Additionally, you need:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Amazon Quick Sight subscription (Standard or Enterprise edition)&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://aws.amazon.com/iam/" target="_blank" rel="noopener noreferrer"&gt;AWS Identity and Access Management permissions&lt;/a&gt; to create Amazon Quick Sight datasets and dashboards&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;&lt;strong&gt;Building a governance dashboard with Amazon Quick Sight&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;To visualize catalog health metrics, connect Amazon Quick Sight to your Athena metadata tables.&lt;/p&gt; 
&lt;h3&gt;Configure Amazon Quick Sight permissions&lt;/h3&gt; 
&lt;ol&gt; 
 &lt;li&gt;Grant permissions to the Amazon Quick Sight service role.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;The Amazon Quick Sight service role (default name: &lt;code&gt;aws-quicksight-service-role-v0&lt;/code&gt;) needs permissions to access &lt;a href="https://aws.amazon.com/s3/features/tables/" target="_blank" rel="noopener noreferrer"&gt;Amazon S3 Tables&lt;/a&gt; and &lt;a href="https://aws.amazon.com/glue/" target="_blank" rel="noopener noreferrer"&gt;AWS Glue&lt;/a&gt; catalog:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-css"&gt;{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3tables:GetTableBucket",
        "s3tables:GetTable",
        "s3tables:GetTableMetadataLocation"
      ],
      "Resource": "arn:aws:s3tables:REGION:ACCOUNT_ID:bucket/aws-sagemaker-catalog/*"
    },
    {
      "Effect": "Allow",
      "Action": "glue:GetCatalog",
      "Resource": "arn:aws:glue:REGION:ACCOUNT_ID:catalog"
    }
  ]
}&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Add this as an inline policy to the Amazon Quick Sight service role in the IAM console.&lt;/p&gt; 
&lt;ol start="2"&gt; 
 &lt;li&gt;Grant &lt;a href="https://aws.amazon.com/lake-formation/" target="_blank" rel="noopener noreferrer"&gt;AWS Lake Formation&lt;/a&gt; permissions:&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;Both the Amazon Quick Sight service role and your Amazon Quick Sight admin user need AWS Lake Formation permissions on the S3 Tables catalog. First, find your Amazon Quick Sight admin user ARN by running this AWS Command Line Interface (AWS CLI) command:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-typescript"&gt;aws quicksight list-users \
  --aws-account-id ACCOUNT_ID \
  --namespace default \
  --region us-east-1&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Amazon Quick Sight users are managed in the Amazon Quick Sight home AWS Region (us-east-1).To grant permissions, use the Lake Formation console.&lt;/p&gt; 
&lt;ol start="3"&gt; 
 &lt;li&gt;Navigate to &lt;strong&gt;AWS Lake Formation&lt;/strong&gt; in the &lt;a href="https://aws.amazon.com/console/" target="_blank" rel="noopener noreferrer"&gt;AWS Management Console&lt;/a&gt;.&lt;/li&gt; 
 &lt;li&gt;Select &lt;strong&gt;Data permissions&lt;/strong&gt; and &lt;strong&gt;Grant.&lt;/strong&gt;&lt;/li&gt; 
 &lt;li&gt;For &lt;strong&gt;Principals&lt;/strong&gt;, choose &lt;strong&gt;SAML users and groups.&lt;/strong&gt;&lt;/li&gt; 
 &lt;li&gt;Enter your Amazon Quick Sight admin user ARN (from the preceding command).&lt;/li&gt; 
 &lt;li&gt;Under &lt;strong&gt;LF-Tags or catalog resources&lt;/strong&gt;, choose &lt;strong&gt;Named Data Catalog resources.&lt;/strong&gt;&lt;/li&gt; 
 &lt;li&gt;For &lt;strong&gt;Catalogs&lt;/strong&gt;, choose the S3 Tables catalog: &lt;strong&gt;ACCOUNT_ID:s3tablescatalog/aws-sagemaker-catalog.&lt;/strong&gt;&lt;/li&gt; 
 &lt;li&gt;For &lt;strong&gt;Databases&lt;/strong&gt;, choose &lt;strong&gt;asset_metadata.&lt;/strong&gt;&lt;/li&gt; 
 &lt;li&gt;Under &lt;strong&gt;Tables&lt;/strong&gt;, choose &lt;strong&gt;asset.&lt;/strong&gt;&lt;/li&gt; 
 &lt;li&gt;For &lt;strong&gt;Table permissions&lt;/strong&gt;, choose &lt;strong&gt;Select&lt;/strong&gt; and &lt;strong&gt;Describe.&lt;/strong&gt;&lt;/li&gt; 
 &lt;li&gt;Select &lt;strong&gt;Grant&lt;/strong&gt;.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90835" style="margin: 10px 0px 10px 0px;border: 1px solid #CCCCCC" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5863-image-2.jpeg" alt="Screenshot of AWS Lake Formation Grant permissions page showing the complete permission configuration workflow. At the top, the resource selection shows the 'asset_metadata' database and 'asset' table from the s3tablescatalog/aws-sagemaker-catalog catalog. Below that are optional sections for Views and Data filters, both unselected. The main content area displays three permission configuration sections. First, the 'Table permissions' section shows two subsections: 'Table permissions' with checkboxes for Select (checked, highlighted with orange box), Describe (checked, highlighted with orange box), Insert, Alter, Delete, Drop, and Super; and 'Grantable permissions' with the same permission options all unchecked. The Super permission includes explanatory text stating it is the union of all individual permissions and supersedes them. The Grantable permissions section explains that this allows the principal to grant any of the permissions to others and supersedes grantable permissions. At the bottom, the 'Data permissions' section displays two radio button options: 'All data access' (selected) which grants access to all data without restrictions, and 'Column-based access' which grants data access to specific columns only. An orange arrow points from the right side down to the bottom right corner where 'Cancel' and 'Grant' buttons are located, with the Grant button highlighted in orange." width="1428" height="1247"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;Figure 2 – Grant access to Amazon SageMaker Catalog resources&lt;/em&gt;&lt;/p&gt; 
&lt;ol start="13"&gt; 
 &lt;li&gt;Repeat steps 1–9 for the Amazon Quick Sight service role, but in step 2 choose &lt;strong&gt;IAM users and roles&lt;/strong&gt; instead.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;When choosing the catalog in the Lake Formation console, you must choose the full S3 Tables catalog identifier (&lt;code&gt;ACCOUNT_ID:s3tablescatalog/aws-sagemaker-catalog&lt;/code&gt;) to see the &lt;code&gt;asset_metadata&lt;/code&gt; database.&lt;/p&gt; 
&lt;h3&gt;Create an Amazon Quick Sight dataset.&lt;/h3&gt; 
&lt;p&gt;Access S3 Tables data by creating a Quick Sight dataset using an Amazon Athena data source and the custom SQL option. An S3 Tables data source is also available but requires additional permissions. See &lt;a href="https://aws.amazon.com/blogs/machine-learning/from-data-lake-to-ai-ready-analytics-introducing-direct-query-with-s3-tables-in-amazon-quick/" rel="noopener" target="_blank"&gt;Introducing new data source with S3 Tables in Amazon Quick&lt;/a&gt; for using S3 Tables as an Amazon Quick data source.&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;Open &lt;strong&gt;Amazon Quick Sight&lt;/strong&gt; in the AWS Management Console.&lt;/li&gt; 
 &lt;li&gt;Select &lt;strong&gt;Analyses &lt;/strong&gt;and &lt;strong&gt;Create analysis.&lt;/strong&gt;&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90834" style="margin: 10px 0px 10px 0px;border: 1px solid #CCCCCC" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5863-image-3.jpeg" alt="Amazon QuickSight Analyses page showing the left navigation menu with Analyses selected under the Quick Sight section. The main content area displays a promotional banner for creating insightful and interactive visualizations with sample chart previews. Below the banner, an orange arrow points to the Create analysis button in the upper right. A table lists an existing analysis named New custom SQL analysis owned by Me and last updated a month ago." width="1431" height="893"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;Figure 3 – Create Amazon Quick Sight analysis&lt;/em&gt;&lt;/p&gt; 
&lt;ol start="3"&gt; 
 &lt;li&gt;Choose &lt;strong&gt;Create dataset &lt;/strong&gt;and &lt;strong&gt;Create data source.&lt;/strong&gt;&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90833" style="margin: 10px 0px 10px 0px;border: 1px solid #CCCCCC" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5863-image-4.jpeg" alt="Amazon QuickSight Create Analysis dialog prompting the user to choose a dataset. A search field for datasets is shown at the top left. An orange arrow points to the Create dataset button in the upper right. A table below lists one available dataset named New custom SQL with a data source of New custom SQL, owned by Me, and last modified on March 5, 2026." width="1428" height="244"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;Figure 4 – Create dataset&lt;/em&gt;&lt;/p&gt; 
&lt;ol start="4"&gt; 
 &lt;li&gt;Select &lt;strong&gt;Amazon Athena&lt;/strong&gt; as the data source and select &lt;strong&gt;Next.&lt;/strong&gt;&lt;/li&gt; 
 &lt;li&gt;Enter a &lt;strong&gt;Data source name&lt;/strong&gt; (for example, “SageMaker Catalog Metadata”) and choose &lt;strong&gt;Create data source.&lt;/strong&gt;&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90832" style="margin: 10px 0px 10px 0px;border: 1px solid #CCCCCC" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5863-image-5.jpeg" alt="Amazon QuickSight New Amazon Athena data source configuration dialog. The Data source name field is highlighted with an orange box and contains the value SageMaker Catalog Metadata. The Athena workgroup dropdown is set to primary. A Validate connection button and SSL is enabled label appear at the bottom left. An orange box highlights the Create data source button at the bottom right." width="1188" height="588"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;Figure 5 – Create data source&lt;/em&gt;&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;Select &lt;strong&gt;Use custom SQL&lt;/strong&gt; and enter a custom SQL query that references the S3 Tables catalog using the full three-part name.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90831" style="margin: 10px 0px 10px 0px;border: 1px solid #CCCCCC" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5863-image-6.jpeg" alt="Amazon QuickSight Choose your table dialog for the SageMaker Catalog Metadata data source. The Catalog dropdown is set to AwsDataCatalog and the Database dropdown shows a Select prompt. An instructional message explains to choose Prepare data to create a SQL query or choose Select table. An orange arrow points down to the Use custom SQL button highlighted with a blue box at the bottom center. The Select button is highlighted with an orange box at the bottom right." width="1186" height="974"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;Figure 6 – Use custom SQL&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90830" style="margin: 10px 0px 10px 0px;border: 1px solid #CCCCCC" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5863-image-7.jpeg" alt="Amazon QuickSight Enter custom SQL query dialog. The query name field shows New custom SQL. The SQL editor contains a query reading SELECT FROM s3tablescatalog/aws-sagemaker-catalog with the query text underlined in orange. An orange box highlights the Confirm query button at the bottom right. An Edit/Preview data button appears at the bottom left." width="1186" height="976"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;Figure 7 – Enter custom SQL&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;&lt;code&gt;SELECT * FROM "s3tablescatalog/aws-sagemaker-catalog".asset_metadata.asset&lt;/code&gt;&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;Select &lt;strong&gt;Confirm query.&lt;/strong&gt;&lt;/li&gt; 
 &lt;li&gt;Choose &lt;strong&gt;Directly query&lt;/strong&gt; your data (SPICE import may fail with S3 Tables catalogs)&lt;br&gt; &lt;img loading="lazy" class="alignnone size-full wp-image-90829" style="margin: 10px 0px 10px 0px;border: 1px solid #CCCCCC" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5863-image-8.jpeg" alt="Amazon QuickSight Finish dataset creation dialog showing the custom SQL dataset named New custom SQL with the SageMaker Catalog Metadata data source. Two radio button options are displayed: Import to SPICE for quicker analytics with 100 GB available shown in green, and Directly query your data which is selected and highlighted with an orange box. An orange box highlights the Visualize button at the bottom right. Edit/Preview data and Augment with SageMaker buttons appear at the bottom left and center." width="1188" height="570"&gt;&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;em&gt;Figure 8 – Directly query your data&lt;/em&gt;&lt;/p&gt; 
&lt;ol start="3"&gt; 
 &lt;li&gt;Choose &lt;strong&gt;Visualize&lt;/strong&gt; and &lt;strong&gt;Create&lt;/strong&gt; to start building your dashboard.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;h3&gt;Create visualizations with Amazon Quick.&lt;/h3&gt; 
&lt;p&gt;With Amazon Quick, you can build governance dashboards using natural language prompts. This removes the need for manual field configuration. This approach is faster and more intuitive than traditional dashboard building.The Amazon Quick Sight user must have AdminPro or AuthorPro subscription (the Build feature isn’t available for Reader users).Start building your dashboard with the following steps:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;Select &lt;strong&gt;Build&lt;/strong&gt; in the top toolbar to open the natural language builder.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90828" style="margin: 10px 0px 10px 0px;border: 1px solid #CCCCCC" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5863-image-9.jpeg" alt="Amazon QuickSight analysis editor for New custom SQL analysis. The left Data panel shows the dataset fields including accountid, asset_created_time, asset_id, asset_name, asset_updated_time, business_description, catalog, extended_metadata, namespace, region, resource_description, resource_id, resource_name, resource_type_enum, and snapshot_time. The center Visuals panel shows the Build button highlighted with an orange box and a grid of available chart types. The right canvas area displays an empty AutoGraph placeholder with the message Add 1 or more fields to build a visual. An Add Data section with a dashed border prompts to add a dimension or measure." width="1431" height="756"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;Figure 9 – Amazon Quick build dashboard&lt;/em&gt;&lt;/p&gt; 
&lt;ol start="2"&gt; 
 &lt;li&gt;You will see a text box where you can describe the visualization that you want to create.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90827" style="margin: 10px 0px 10px 0px;border: 1px solid #CCCCCC" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5863-image-10.jpeg" alt="Amazon QuickSight analysis editor with the Build a visual panel open on the right side. An orange arrow points to the natural language input field where the user has typed a prompt requesting asset distribution by resource type as a pie chart, with a Build button next to it. Below the input field, a tooltip explains to describe the visual you would like to build with examples including map showing the top 5 cities by sales, MoM profit in 2026, and average revenue by quarter. The left Data panel shows dataset fields and the center Visuals panel displays available chart types." width="1518" height="687"&gt;&lt;/p&gt; 
&lt;p&gt;Create each visualization using natural language. For each of the six recommended visualizations, enter the corresponding natural language prompt, select &lt;strong&gt;Build&lt;/strong&gt;, then choose &lt;strong&gt;ADD TO ANALYSIS.&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90826" style="margin: 10px 0px 10px 0px;border: 1px solid #CCCCCC" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5863-image-11.jpeg" alt="Amazon QuickSight analysis editor with the Build a visual panel open on the right side. The natural language prompt reads Show asset distribution by resource type as a pie chart with a Build button. Below, the system shows the interpretation as Unique number of Asset Id by Resource Type Enum using the New custom SQL dataset. A pie chart preview is displayed showing the distribution with a large segment labeled GlueTable. An orange arrow points to the Add to Analysis button at the bottom of the panel." width="1505" height="572"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;Figure 11 – Add to analysis&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;Visualization 1: Asset inventory by type&lt;/p&gt; 
&lt;p&gt;&lt;code&gt;Show count of asset_id by resource_type_enum as a pie chart&lt;/code&gt;&lt;/p&gt; 
&lt;p&gt;After the pie chart is created, choose &lt;strong&gt;ADD TO ANALYSIS&lt;/strong&gt;.&lt;/p&gt; 
&lt;p&gt;Visualization 2: Documentation completeness&lt;/p&gt; 
&lt;p&gt;&lt;code&gt;Show count of asset_id where business_description is not null asa KPI&lt;/code&gt;&lt;/p&gt; 
&lt;p&gt;After the KPI is created, choose &lt;strong&gt;ADD TO ANALYSIS&lt;/strong&gt;.&lt;/p&gt; 
&lt;p&gt;Visualization 3: Monthly registration trends&lt;/p&gt; 
&lt;p&gt;&lt;code&gt;Show count of asset_id by asset_created_time month as a line chart&lt;/code&gt;&lt;/p&gt; 
&lt;p&gt;After the line chart is created, choose &lt;strong&gt;ADD TO ANALYSIS&lt;/strong&gt;.&lt;/p&gt; 
&lt;p&gt;Visualization 4: Asset count by account&lt;/p&gt; 
&lt;p&gt;&lt;code&gt;Show count of asset_id by account_id as a bar chart&lt;/code&gt;&lt;/p&gt; 
&lt;p&gt;After the bar chart is created, choose &lt;strong&gt;ADD TO ANALYSIS&lt;/strong&gt;.&lt;/p&gt; 
&lt;p&gt;Visualization 5: Namespace distribution&lt;/p&gt; 
&lt;p&gt;&lt;code&gt;Show count of asset_id by namespace as a treemap&lt;/code&gt;&lt;/p&gt; 
&lt;p&gt;After the treemap is created, choose &lt;strong&gt;ADD TO ANALYSIS&lt;/strong&gt;.&lt;/p&gt; 
&lt;p&gt;Visualization 6: Resource type by namespace&lt;/p&gt; 
&lt;p&gt;&lt;code&gt;Show count of asset_id by resource_type_enum and namespace as a heat map&lt;/code&gt;&lt;/p&gt; 
&lt;p&gt;Choose &lt;strong&gt;ADD TO ANALYSIS&lt;/strong&gt;&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;Arrange and publish your governance dashboard with the following steps:&lt;/li&gt; 
 &lt;li&gt;Delete any empty or unwanted visualizations by choosing the three dots menu and choosing &lt;strong&gt;Delete.&lt;/strong&gt;&lt;/li&gt; 
 &lt;li&gt;Arrange visualizations by dragging them into your preferred layout.&lt;/li&gt; 
 &lt;li&gt;Resize visualizations to emphasize key metrics.&lt;/li&gt; 
 &lt;li&gt;Add titles to each visualization for clarity.&lt;/li&gt; 
 &lt;li&gt;Choose &lt;strong&gt;PUBLISH&lt;/strong&gt; in the top right corner.&lt;/li&gt; 
 &lt;li&gt;Enter a dashboard name: “SageMaker Catalog Governance Dashboard”.&lt;/li&gt; 
 &lt;li&gt;Verify these options are selected: 
  &lt;ol type="a"&gt; 
   &lt;li&gt;&lt;strong&gt;Allow executive summary.&lt;/strong&gt;&lt;/li&gt; 
   &lt;li&gt;&lt;strong&gt;Allow sharing stories.&lt;/strong&gt;&lt;/li&gt; 
   &lt;li&gt;&lt;strong&gt;Allow sharing scenarios.&lt;/strong&gt;&lt;/li&gt; 
  &lt;/ol&gt; &lt;/li&gt; 
 &lt;li&gt;Choose &lt;strong&gt;Publish dashboard.&lt;/strong&gt;&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90825" style="margin: 10px 0px 10px 0px;border: 1px solid #CCCCCC" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5863-image-12.jpeg" alt="Amazon QuickSight SageMaker Catalog Governance Dashboard showing five visualizations. Top left is a pie chart titled Unique number of Asset Id by Resource Type showing all assets as GlueTable type. Top center is a key performance indicator displaying a total of 500 unique assets. Top right is a horizontal bar chart titled Unique number of Asset Id by Account Id showing five AWS account IDs with values of 109, 105, 104, 103, and 79 assets respectively. Middle left is a stacked bar chart titled Unique number of Asset Id by Resource Type Enum and Namespace showing GlueTable assets distributed across namespaces with values ranging from 33 to 52. Middle right is a treemap titled Unique number of Asset Id by Namespace with trading_analytics at 52, compliance_reporting at 51, treasury_ops at 50, market_data at 44, fraud_detection at 42, customer_analytics at 40, credit_scoring at 40, risk_management at 39, portfolio_mgmt at 37, regulatory at 37, loan_origination at 35, and payments at 33. Bottom is a line chart titled Unique number of Asset Id by Asset Created Time month showing asset creation trends from April 2025 through March 2026 with values fluctuating between approximately 30 and 50 assets per month." width="1429" height="1106"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;Figure 12 – Amazon SageMaker Catalog governance dashboard&lt;/em&gt;&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li style="list-style-type: none"&gt; 
  &lt;ol start="10"&gt; 
   &lt;li&gt;Analyze your dashboard with natural language.&lt;/li&gt; 
  &lt;/ol&gt; &lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;After you publish, you can ask questions about your governance data:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li style="list-style-type: none"&gt; 
  &lt;ol start="11"&gt; 
   &lt;li&gt;On the dashboard, choose &lt;strong&gt;Analyze this dashboard&lt;/strong&gt; &lt;strong&gt;in a Scenario &lt;/strong&gt;in the top center.&lt;/li&gt; 
   &lt;li&gt;In the &lt;strong&gt;Data to Insights&lt;/strong&gt; panel, enter natural language questions such as: 
    &lt;ol type="a"&gt; 
     &lt;li&gt;“Which resource types have the lowest documentation rates?”&lt;/li&gt; 
     &lt;li&gt;“How many assets were registered last month compared to this month?”&lt;/li&gt; 
     &lt;li&gt;“What percentage of assets lack ownership information?”&lt;/li&gt; 
    &lt;/ol&gt; &lt;/li&gt; 
   &lt;li&gt;Choose &lt;strong&gt;Submit &lt;/strong&gt;to generate AI-powered insights.&lt;/li&gt; 
  &lt;/ol&gt; &lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;Amazon Quick analyzes your data and provides insights with supporting visualizations.&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li style="list-style-type: none"&gt; 
  &lt;ol start="14"&gt; 
   &lt;li&gt;Generate executive summaries&lt;/li&gt; 
  &lt;/ol&gt; &lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;Create automated governance reports for data stewards and compliance teams:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li style="list-style-type: none"&gt; 
  &lt;ol start="15"&gt; 
   &lt;li&gt;Choose the Amazon Quick logo in the top left to return to the home page&lt;/li&gt; 
   &lt;li&gt;Select &lt;strong&gt;Dashboards&lt;/strong&gt; from the left panel&lt;/li&gt; 
   &lt;li&gt;Choose your “SageMaker Catalog Governance Dashboard”&lt;/li&gt; 
   &lt;li&gt;Choose the &lt;strong&gt;Create &lt;/strong&gt;dropdown menu in the top right&lt;/li&gt; 
   &lt;li&gt;Select &lt;strong&gt;Executive Summary&lt;/strong&gt;&lt;/li&gt; 
  &lt;/ol&gt; &lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;Amazon Quick will automatically generate a summary with key governance insights, including Total asset counts and growth trends, Documentation completeness metrics, Ownership coverage statistics, and Classification distribution analysis.&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li style="list-style-type: none"&gt; 
  &lt;ol start="20"&gt; 
   &lt;li&gt;Create governance stories.&lt;/li&gt; 
  &lt;/ol&gt; &lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;Build governance reports that combine multiple dashboards:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li style="list-style-type: none"&gt; 
  &lt;ol start="21"&gt; 
   &lt;li&gt;From the Create dropdown, select &lt;strong&gt;Story.&lt;/strong&gt;&lt;/li&gt; 
   &lt;li&gt;Enter a prompt: “Write a summary of catalog governance metrics and data quality trends”.&lt;/li&gt; 
   &lt;li&gt;Choose &lt;strong&gt;Add &lt;/strong&gt;to select dashboards to include in the report.&lt;/li&gt; 
   &lt;li&gt;Choose&lt;strong&gt; Build&lt;/strong&gt; (this might take a few minutes to complete).&lt;/li&gt; 
  &lt;/ol&gt; &lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;Amazon Quick will generate a narrative report combining your visualizations with AI-generated insights. Share the reports with leadership or compliance teams.&lt;/p&gt; 
&lt;p&gt;Governance dashboards contain metadata such as ownership and classification details. Restrict access to users who need it. In the Amazon Quick Sight console, open the dashboard, choose &lt;strong&gt;Share&lt;/strong&gt;, and grant access to named users or a dedicated Quick Sight group (for example, data-stewards) instead of selecting &lt;strong&gt;Everyone in this account&lt;/strong&gt;. Review the dashboard’s permissions periodically and remove entries that are no longer needed.&lt;/p&gt; 
&lt;h2&gt;&lt;strong&gt;Cleaning up&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;To avoid ongoing charges, clean up the resources created in this walkthrough. Delete Amazon Quick Sight resources including the dashboard, analyses, and dataset.&lt;/p&gt; 
&lt;h2&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;In this post, you connected Amazon Quick Sight to your Amazon SageMaker Catalog metadata export, built governance dashboards using the Amazon Quick natural language prompts. This approach gives data stewards and compliance teams visibility into catalog health through six key visualizations covering asset inventory, documentation completeness, registration trends, account distribution, classification coverage, and stale asset detection.&lt;/p&gt; 
&lt;p&gt;Together with the metadata export and SQL query capabilities covered in the &lt;a href="https://aws.amazon.com/blogs/big-data/analyzing-your-data-catalog-query-sagemaker-catalog-metadata-with-sql/" target="_blank" rel="noopener noreferrer"&gt;Analyzing your data catalog: Query SageMaker Catalog metadata with SQL&lt;/a&gt; post, this solution provides a complete, low-overhead governance monitoring pipeline from raw catalog metadata to executive-ready.&lt;/p&gt; 
&lt;p&gt;To learn more about Amazon SageMaker Catalogs, see &lt;a href="https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-catalog.html" target="_blank" rel="noopener noreferrer"&gt;Amazon SageMaker Catalog documentation&lt;/a&gt;. To expand the work done with Amazon Quick, review &lt;a href="https://docs.aws.amazon.com/quicksight/" target="_blank" rel="noopener noreferrer"&gt;Amazon Quick Sight documentation&lt;/a&gt;.&lt;/p&gt; 
&lt;hr style="width: 80%"&gt; 
&lt;h2&gt;About the authors&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignnone size-full wp-image-90922" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/steve.jpg" alt="" width="100" height="133"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4" style="text-align: left"&gt;Steve Phillips&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/stevephillipsca"&gt;Steve&lt;/a&gt; is a Principal Technical Account Manager and Analytics specialist at AWS in the North America region. Steve currently focuses on data warehouse architectural design, data lakes, data ingestion pipelines, and cloud distributed architectures.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignnone size-full wp-image-90921" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/ramesh.png" alt="" width="100" height="135"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4" style="text-align: left"&gt;Ramesh Singh&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="http://www.linkedin.com/in/ramesh-harisaran-singh" target="_blank" rel="noopener"&gt;Ramesh&lt;/a&gt; is a Senior Product Manager Technical (External Services) at AWS in Seattle, Washington, currently with the Amazon SageMaker team. He is passionate about building high-performance ML/AI and analytics products that help enterprise customers achieve their critical goals using cutting-edge technology.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignnone size-full wp-image-90920" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/pradeep.png" alt="" width="100" height="133"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4" style="text-align: left"&gt;Pradeep Misra&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/pradeep-m-326258a/" target="_blank" rel="noopener"&gt;Pradeep&lt;/a&gt; is a Principal Analytics and Applied AI Solutions Architect at AWS. He is passionate about solving customer challenges using data, analytics, and Applied AI. Outside of work, he likes exploring new places and playing badminton with his family. He also likes doing science experiments, building LEGOs, and watching anime with his daughters.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignnone size-full wp-image-90919" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/rohith.png" alt="" width="100" height="106"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4" style="text-align: left"&gt;Rohith Kayathi&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/rohith-kayathi/" target="_blank" rel="noopener"&gt;Rohith&lt;/a&gt; is a Senior Software Engineer at Amazon Web Services (AWS) working with Amazon SageMaker team. He leads business data catalog, generative AI–powered metadata curation, and lineage solutions. He is passionate about building large-scale distributed systems, solving complex problems, and setting the bar for engineering excellence for his team.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
					
			
		
		
			</item>
		<item>
		<title>Accelerate SQL development with SageMaker Data Agent in Query Editor</title>
		<link>https://aws.amazon.com/blogs/big-data/accelerate-sql-development-with-sagemaker-data-agent-in-query-editor/</link>
		
		<dc:creator><![CDATA[Jason Ramos]]></dc:creator>
		<pubDate>Thu, 04 Jun 2026 15:40:11 +0000</pubDate>
				<category><![CDATA[Advanced (300)]]></category>
		<category><![CDATA[Amazon SageMaker Data & AI Governance]]></category>
		<category><![CDATA[Amazon SageMaker Unified Studio]]></category>
		<category><![CDATA[Technical How-to]]></category>
		<guid isPermaLink="false">abc0c7aeb8e3f1ba67f432032ae8dcda4e92c839</guid>

					<description>In this post, you learn how to use Data Agent in Query Editor to explore data, build multi-step analyses, recover from errors, and summarize results using a public education dataset.</description>
										<content:encoded>&lt;p&gt;When you develop SQL against &lt;a href="https://docs.aws.amazon.com/redshift/" target="_blank" rel="noopener"&gt;Amazon Redshift&lt;/a&gt; and &lt;a href="https://docs.aws.amazon.com/athena/" target="_blank" rel="noopener"&gt;Amazon Athena&lt;/a&gt;, you spend time finding the right tables across hundreds of databases, writing complex joins and aggregations, debugging failed queries without context from previous attempts, and re-specifying filters for every new question. Amazon SageMaker Data Agent in Query Editor takes a different approach. You describe what you need in natural language, and the Data Agent generates the SQL. It references your actual tables through &lt;a href="https://docs.aws.amazon.com/glue/latest/dg/catalog-and-crawler.html" target="_blank" rel="noopener"&gt;AWS Glue Data Catalog&lt;/a&gt;, proposes step-by-step plans for complex questions, retains context across your session, and offers one-click error recovery with Fix with AI. In this post, you learn how to use Data Agent in Query Editor to explore data, build multi-step analyses, recover from errors, and summarize results using a public education dataset.&lt;/p&gt; 
&lt;h2 id="solution-overview"&gt;Solution overview&lt;/h2&gt; 
&lt;p&gt;You can go from a natural language question to executable SQL in seconds. Data Agent in Query Editor provides a conversational interface with direct access to your AWS data environment, so you spend less time on query mechanics and more time on analysis. Data Agent in Query Editor focuses specifically on SQL development against Amazon Redshift and Amazon Athena. (For Python, SQL, and PySpark across broader analytical and machine learning (ML) workloads, use Data Agent in notebooks.)&lt;/p&gt; 
&lt;p&gt;Data Agent provides four key capabilities:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Catalog-aware SQL generation.&lt;/strong&gt; You don’t need to browse catalog structures or memorize schema details. Data Agent reads your table metadata directly.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Querybook and session context.&lt;/strong&gt; You build on previous work. Data Agent uses context from your earlier queries and results.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Step-by-step planning.&lt;/strong&gt; You review and approve a structured plan before Data Agent generates SQL.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Fix with AI.&lt;/strong&gt; You recover from failed queries with one click.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;Data Agent integrates with AWS Glue Data Catalog and reads your actual table names, column types, descriptions, and relationships, so generated SQL references your real tables. Each follow-up question builds on your current Query Editor session—the SQL cells in your querybook, the active connection, your selected cell, and execution results from previously run cells. For complex requests, Data Agent produces a structured plan that specifies which data to retrieve, how to aggregate it, and what filters to apply. You review and approve each step before Data Agent proceeds. When a query fails, choose Fix with AI to get a corrected query based on the error and the failed cell’s context.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/06/02/BDB-5924-1.jpg" alt="Query Editor Fix with AI panel showing a corrected SQL query ready for review" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;[Figure 1: The Query Editor Fix with AI panel, showing a corrected SQL query ready for your review.]&lt;/em&gt;&lt;/p&gt; 
&lt;h2 id="walkthrough-education-data-analysis"&gt;Walkthrough: Education data analysis&lt;/h2&gt; 
&lt;p&gt;In this section, you use Data Agent in Query Editor to analyze California schools data and identify where SAT improvement investment has the most impact. The walkthrough covers four tasks:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Explore available data.&lt;/li&gt; 
 &lt;li&gt;Build a multi-step analysis plan.&lt;/li&gt; 
 &lt;li&gt;Summarize insights from your queries.&lt;/li&gt; 
 &lt;li&gt;Recover from a failed query.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;The same workflow applies to your own data, whether you are analyzing sales figures, operational metrics, or financial records.&lt;/p&gt; 
&lt;p&gt;The California schools dataset contains SAT score results, school demographic information, and county-level data for public schools across California. The dataset includes tables that organize SAT scores by subject (reading, writing, math), school details (name, address, county, district), and enrollment figures. After you upload the data into your project database, you directly access the tables from Query Editor through your Amazon Athena or Amazon Redshift Lakehouse connection.&lt;/p&gt; 
&lt;h3 id="prerequisites"&gt;Prerequisites&lt;/h3&gt; 
&lt;p&gt;To complete this walkthrough, you need intermediate SQL knowledge and basic familiarity with the AWS Management Console. You don’t need prior AWS Glue experience, but familiarity with data catalogs (centralized metadata repositories) helps.&lt;/p&gt; 
&lt;p&gt;You can choose one of two setup paths:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Quick start (5 minutes).&lt;/strong&gt; SageMaker Unified Studio provides a sample database (&lt;code&gt;sagemaker_sample_db&lt;/code&gt;) with pre-loaded data. To explore it, choose &lt;strong&gt;Data&lt;/strong&gt; in the navigation pane, expand &lt;strong&gt;AwsDataCatalog&lt;/strong&gt;, and select &lt;code&gt;sagemaker_sample_db&lt;/code&gt;.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Full setup (30–45 minutes).&lt;/strong&gt; Upload the California schools dataset into your project’s Lakehouse database. This dataset is publicly available from the California Department of Education. Download the SAT scores, school information, and county-level data files, then upload them through the SageMaker Unified Studio UI. In your project, go to &lt;strong&gt;Build&lt;/strong&gt;, choose &lt;strong&gt;Query editor&lt;/strong&gt;, right-click your project database in the Data explorer, and choose &lt;strong&gt;Create table&lt;/strong&gt;. Drag and drop each CSV file to create the tables. SageMaker Unified Studio stores the data in the project-managed Amazon Simple Storage Service (Amazon S3) location, registers it in AWS Glue Data Catalog, and applies &lt;a href="https://docs.aws.amazon.com/lake-formation/latest/dg/what-is-lake-formation.html" target="_blank" rel="noopener"&gt;AWS Lake Formation&lt;/a&gt; governance automatically.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;Running queries against Amazon Athena or Amazon Redshift might incur costs. For pricing details, refer to &lt;a href="https://aws.amazon.com/athena/pricing/" target="_blank" rel="noopener"&gt;Amazon Athena pricing&lt;/a&gt; and &lt;a href="https://aws.amazon.com/redshift/pricing/" target="_blank" rel="noopener"&gt;Amazon Redshift pricing&lt;/a&gt;. For detailed setup instructions, refer to &lt;a href="https://docs.aws.amazon.com/sagemaker-unified-studio/latest/adminguide/iam-based-domains.html" target="_blank" rel="noopener"&gt;AWS Identity and Access Management (IAM)-based domains and projects&lt;/a&gt;. Before starting the walkthrough, you must have a SageMaker Unified Studio IAM-based domain with a project using the &lt;strong&gt;SQL analytics&lt;/strong&gt; or &lt;strong&gt;All Capabilities&lt;/strong&gt; project profile. The project automatically provisions an AWS Glue database, the required IAM role, and Athena or Redshift Lakehouse connections.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/06/02/BDB-5924-2.jpg" alt="Data Explorer panel in Query Editor showing the california_schools_db and sagemaker_sample_db tables" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;[Figure 2: The Data Explorer panel in Query Editor, showing the &lt;code&gt;california_schools_db&lt;/code&gt; and &lt;code&gt;sagemaker_sample_db&lt;/code&gt; tables.]&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Explore available data.&lt;/strong&gt; To start, enter the following prompt in the Data Agent panel:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-plaintext"&gt;Query my SAT scores from my california_schools_db&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Data Agent searches AWS Glue Data Catalog, locates the relevant tables, and generates an initial exploratory query that retrieves SAT score records. It adds a SQL cell directly to your querybook.&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Review the generated SQL in the comparison view, which highlights the proposed code.&lt;/li&gt; 
 &lt;li&gt;Choose &lt;strong&gt;Accept&lt;/strong&gt;, &lt;strong&gt;Reject&lt;/strong&gt;, or &lt;strong&gt;Accept and run&lt;/strong&gt;.&lt;/li&gt; 
 &lt;li&gt;After you run the cell, the results appear inline, giving you a view of the data (column names, score ranges, and the number of records) before you write SQL.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/06/02/BDB-5924-3.jpg" alt="Data Agent returns an exploratory query for the california_schools_db tables, ready for review" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;[Figure 3: Data Agent returns an exploratory query for the &lt;code&gt;california_schools_db&lt;/code&gt; tables, ready for your review.]&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/06/02/BDB-5924-4.jpg" alt="SQL query results appear beneath the cell after choosing Accept and run" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;[Figure 4: The SQL query results appear beneath the cell after you choose Accept and run.]&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Build a multi-step analysis plan.&lt;/strong&gt; With the data explored, enter a more complex analytical question:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-plaintext"&gt;Identify which subjects need investment to improve SAT scores in the lowest-performing counties. Include school-level details with addresses.&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Data Agent proposes a step-by-step plan before generating SQL. For this request, Data Agent breaks the question into three steps:&lt;/p&gt; 
&lt;ol type="1"&gt; 
 &lt;li&gt;Aggregate SAT scores by county and subject to find performance patterns.&lt;/li&gt; 
 &lt;li&gt;Filter to counties with a sufficient number of schools and rank the lowest performers.&lt;/li&gt; 
 &lt;li&gt;Join school address data to produce a final detailed list.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;Review the plan in the Data Agent panel and choose &lt;strong&gt;Run step-by-step&lt;/strong&gt; to proceed.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/06/02/BDB-5924-5.jpg" alt="Data Agent proposes a multi-step plan with Cancel plan and Run step-by-step options" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;[Figure 5: Data Agent proposes a multi-step plan with options to Cancel plan or Run step-by-step.]&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;Data Agent generates SQL for each step and adds it as a separate querybook cell. Review each cell’s SQL in the comparison view, then choose &lt;strong&gt;Accept and run&lt;/strong&gt; to execute it. The results from each step are visible inline, so you can verify the intermediate output (county-level aggregations, the filtered ranking, and the final school list) before moving to the next step. When the steps are complete, your querybook contains the full analytical progression from raw scores to a detailed investment list.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/06/02/BDB-5924-6.jpg" alt="Each plan step produces a separate querybook cell that can be reviewed and run independently" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;[Figure 6: Each plan step produces a querybook cell that you can review and run independently.]&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Summarize insights from your queries.&lt;/strong&gt; After running the analysis, enter the following prompt:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-plaintext"&gt;Summarize the insights from my queries&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Data Agent has context on your querybook, including the SQL and the query results from each cell. It generates a natural language summary: which counties are underperforming, which subjects (reading, writing, or math) need the most attention in each county, and how many schools appear on the investment list. This summary provides a starting point for a report or presentation.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/06/02/BDB-5924-7.jpg" alt="Data Agent summarizes insights from the accumulated query results in the querybook" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;[Figure 7: Data Agent summarizes insights from the accumulated query results in the querybook.]&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Recover from a failed query.&lt;/strong&gt; During the analysis, a generated query might produce an error, for example, referencing a column name that doesn’t match the schema or a join condition that returns unexpected results. When a cell fails, Query Editor displays the error message and a &lt;strong&gt;Fix with AI&lt;/strong&gt; option.&lt;/p&gt; 
&lt;p&gt;Choose &lt;strong&gt;Fix with AI&lt;/strong&gt;, and Data Agent reads the error in the context of the failed cell, then generates corrected SQL and updates the querybook cell. Run the corrected cell to verify the fix.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/06/02/BDB-5924-8.jpg" alt="After choosing Fix with AI, Data Agent generates a corrected query for the failed cell" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;[Figure 8: After you choose Fix with AI, Data Agent is prompted to generate a corrected query for the failed cell.]&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/06/02/BDB-5924-9.jpg" alt="Data Agent returns corrected SQL for review" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;[Figure 9: Data Agent returns corrected SQL for you to review.]&lt;/em&gt;&lt;/p&gt; 
&lt;h2 id="security-and-governance"&gt;Security and governance&lt;/h2&gt; 
&lt;p&gt;Data Agent operates within your AWS environment and only accesses data that your IAM policies explicitly permit. Your existing IAM access controls and AWS Lake Formation permissions determine what data Data Agent can reach. To use Data Agent, your project role must have permissions to invoke specific &lt;a href="https://docs.aws.amazon.com/datazone/latest/userguide/what-is-datazone.html" target="_blank" rel="noopener"&gt;Amazon DataZone&lt;/a&gt; APIs. For more information, refer to &lt;a href="https://docs.aws.amazon.com/service-authorization/latest/reference/list_amazondatazone.html" target="_blank" rel="noopener"&gt;Actions, resources, and condition keys for Amazon DataZone&lt;/a&gt;.&lt;/p&gt; 
&lt;p&gt;Data Agent includes content filtering that prevents it from responding to off-topic requests, requests to reveal its system prompt, and requests for internal technical implementation details. Data Agent is restricted to AWS-related topics and English-language output.&lt;/p&gt; 
&lt;p&gt;Amazon SageMaker stores your natural language prompts and generated SQL in the AWS Region where you created your SageMaker Unified Studio domain. Data Agent doesn’t store your data, querybook context, or catalog metadata.&lt;/p&gt; 
&lt;p&gt;To opt out of data usage for service improvement, configure an AI services opt-out policy for Amazon DataZone in AWS Organizations. For more information, refer to &lt;a href="https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/data-agent-data-storage.html" target="_blank" rel="noopener"&gt;Data storage in the SageMaker Data Agent&lt;/a&gt;, &lt;a href="https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/data-agent-service-improvement.html" target="_blank" rel="noopener"&gt;Service improvement&lt;/a&gt;, and &lt;a href="https://docs.aws.amazon.com/organizations/latest/userguide/orgs_manage_policies_ai-opt-out.html" target="_blank" rel="noopener"&gt;AI services opt-out policies&lt;/a&gt;.&lt;/p&gt; 
&lt;h2 id="clean-up"&gt;Clean up&lt;/h2&gt; 
&lt;p&gt;The walkthrough creates querybook cells in your Query Editor session but doesn’t provision standalone infrastructure. To remove the generated SQL cells, delete them from your querybook or delete the querybook itself.&lt;/p&gt; 
&lt;p&gt;If you uploaded the California schools dataset specifically for this walkthrough, remove the following resources to avoid ongoing charges:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;SageMaker Unified Studio domain.&lt;/strong&gt; If you created a domain solely for this walkthrough, delete it to stop incurring charges. Refer to the SageMaker Unified Studio administration guide for deletion steps.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Uploaded tables.&lt;/strong&gt; In the Data explorer, right-click each table you created and choose &lt;strong&gt;Delete table&lt;/strong&gt; to remove the data from your project database and the underlying S3 storage.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Amazon Athena query results.&lt;/strong&gt; Amazon Athena stores query results in an S3 output location. Delete the query result files from that bucket, or delete the bucket if you created it solely for this walkthrough.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Amazon CloudWatch logs.&lt;/strong&gt; If Amazon Athena queries generated CloudWatch log groups, delete those log groups to avoid storage charges.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2 id="conclusion"&gt;Conclusion&lt;/h2&gt; 
&lt;p&gt;Data Agent in Query Editor brings conversational, catalog-aware SQL development to your Amazon Redshift and Amazon Athena workloads. In this post, you explored unfamiliar data, built a multi-step investment analysis, recovered from query errors, and summarized findings through natural language prompts.&lt;/p&gt; 
&lt;p&gt;Data Agent works within your existing IAM and AWS Lake Formation security controls, keeps your data within your AWS environment, and retains context across your analytical workflow so each question builds on the last.&lt;/p&gt; 
&lt;p&gt;Get started with these next steps:&lt;/p&gt; 
&lt;ol type="1"&gt; 
 &lt;li&gt;&lt;strong&gt;Run your first prompt.&lt;/strong&gt; Open Query Editor in your SageMaker Unified Studio domain and enter &lt;code&gt;Show me the top 10 tables in my catalog with the most columns&lt;/code&gt;. For setup, refer to the &lt;a href="https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/getting-started.html" target="_blank" rel="noopener"&gt;SageMaker Unified Studio getting started guide&lt;/a&gt;.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Add descriptions to your AWS Glue Data Catalog.&lt;/strong&gt; Table descriptions and column-level business metadata improve the quality of generated SQL. For best practices, refer to &lt;a href="https://docs.aws.amazon.com/glue/latest/dg/populate-data-catalog.html" target="_blank" rel="noopener"&gt;Populating the AWS Glue Data Catalog&lt;/a&gt;.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Try a multi-step analysis.&lt;/strong&gt; Enter &lt;code&gt;Which product categories had declining revenue quarter-over-quarter, and which regions drove the decline?&lt;/code&gt; and review Data Agent’s plan step by step.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;For more information, refer to the &lt;a href="https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/sagemaker-data-agent.html" target="_blank" rel="noopener"&gt;Amazon SageMaker Data Agent documentation&lt;/a&gt;, the &lt;a href="https://aws.amazon.com/about-aws/whats-new/2026/03/amazon-sagemaker-data-agent-query-editor/" target="_blank" rel="noopener"&gt;What’s New blog post&lt;/a&gt;, &lt;a href="https://docs.aws.amazon.com/redshift/" target="_blank" rel="noopener"&gt;Amazon Redshift documentation&lt;/a&gt;, and &lt;a href="https://docs.aws.amazon.com/athena/" target="_blank" rel="noopener"&gt;Amazon Athena documentation&lt;/a&gt;. To learn how Data Agent works in notebooks, refer to &lt;a href="https://aws.amazon.com/blogs/big-data/accelerate-context-aware-data-analysis-and-ml-workflows-with-amazon-sagemaker-data-agent/" target="_blank" rel="noopener"&gt;Accelerate context-aware data analysis and ML workflows with Amazon SageMaker Data Agent&lt;/a&gt;.&lt;/p&gt; 
&lt;hr&gt; 
&lt;h2&gt;About the authors&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt; 
   &lt;p&gt;&lt;img loading="lazy" class="alignleft size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/06/02/BDB-5924-10.jpg" alt="Jason Ramos" width="100" height="100"&gt;&lt;/p&gt; 
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Jason Ramos&lt;/h3&gt; 
  &lt;p&gt;Jason is a Front-End Engineer on the Amazon SageMaker Unified Studio team. He builds the scalable frontend experiences that power SageMaker Data Agent, bringing conversational AI capabilities to data scientists, analysts, and engineers across SageMaker Unified Studio. Outside of work, he enjoys playing piano and exploring the Bay Area food scene.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt; 
   &lt;p&gt;&lt;img loading="lazy" class="alignleft size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/06/02/BDB-5924-11.jpg" alt="Olena Mursalova" width="100" height="100"&gt;&lt;/p&gt; 
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Olena Mursalova&lt;/h3&gt; 
  &lt;p&gt;Olena is a Software Development Engineer on the Amazon SageMaker Unified Studio team, where she develops the SageMaker Data Agent — an intelligent assistant that turns natural language prompts into code, visualizations, and data insights for data engineers and analysts.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt; 
   &lt;p&gt;&lt;img loading="lazy" class="alignleft size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/06/02/BDB-5924-12.jpg" alt="Jessica Cheng" width="100" height="100"&gt;&lt;/p&gt; 
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Jessica Cheng&lt;/h3&gt; 
  &lt;p&gt;Jessica is a Front-End Engineer on the Amazon SageMaker Unified Studio team based in the Bay Area, where she builds intelligent data agent experiences. At work, she is passionate about creating accessible, easy-to-use experiences at scale. Outside of work, her passions lie in finding the best swimming hole in California.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt; 
   &lt;p&gt;&lt;img loading="lazy" class="alignleft size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/06/02/BDB-5924-13.jpg" alt="Sanjana Sekar" width="100" height="100"&gt;&lt;/p&gt; 
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Sanjana Sekar&lt;/h3&gt; 
  &lt;p&gt;Sanjana is a Software Development Engineer on the Amazon SageMaker Unified Studio team. She was one of the engineers who built the SageMaker Data Agent, bringing conversational AI-powered SQL generation and debugging to Query Editor. She is focused on improving data agent capabilities and the compute blueprints experience within SageMaker Unified Studio. Outside of work, she enjoys hiking and biking.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt; 
   &lt;p&gt;&lt;img loading="lazy" class="alignleft size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/01/26/guptsid-BDB-57074-225x300.jpg" alt="Siddharth Gupta" width="100" height="100"&gt;&lt;/p&gt; 
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Siddharth Gupta&lt;/h3&gt; 
  &lt;p&gt;Siddharth is heading Generative AI within SageMaker’s Unified Experiences. His focus is on driving agentic experiences, where AI systems act autonomously on behalf of users to accomplish complex tasks. An alumnus of the University of Illinois at Urbana-Champaign, he brings extensive experience from his roles at Yahoo, Glassdoor, and Twitch.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
		
		
			</item>
		<item>
		<title>Schedule notebook runs in Amazon SageMaker Unified Studio</title>
		<link>https://aws.amazon.com/blogs/big-data/schedule-notebook-runs-in-amazon-sagemaker-unified-studio/</link>
		
		<dc:creator><![CDATA[Shivani Mehendarge]]></dc:creator>
		<pubDate>Wed, 03 Jun 2026 20:19:11 +0000</pubDate>
				<category><![CDATA[Advanced (300)]]></category>
		<category><![CDATA[Amazon SageMaker Unified Studio]]></category>
		<category><![CDATA[Announcements]]></category>
		<category><![CDATA[Technical How-to]]></category>
		<guid isPermaLink="false">0cf8bfe6ea74c2fb21eb18c3f96fe1e5a34f317d</guid>

					<description>In this post, we walk you through the new scheduling and orchestrating capabilities for notebooks in Amazon SageMaker Unified Studio.</description>
										<content:encoded>&lt;p&gt;If you build notebooks for recurring tasks such as daily customer analysis, weekly report generation, or data quality checks in &lt;a href="https://aws.amazon.com/sagemaker/unified-studio/" target="_blank" rel="noopener"&gt;Amazon SageMaker Unified Studio&lt;/a&gt;, you’ve likely wanted to run them automatically on a schedule. Until now, there wasn’t a native way to do this. Teams had to manage orchestration separately, even though the interactive notebook experience was already in place. Now, notebook scheduling is available, so you can configure your production workloads to run automatically with minimal manual intervention.&lt;/p&gt; 
&lt;p&gt;In this post, we walk you through the new scheduling and orchestrating capabilities for notebooks in Amazon SageMaker Unified Studio. You will learn how to:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Trigger on-demand background runs, such as a model re-training job, without waiting at your desk.&lt;/li&gt; 
 &lt;li&gt;Create recurring schedules for tasks such as nightly data freshness checks or weekly business reviews.&lt;/li&gt; 
 &lt;li&gt;Parameterize notebooks so a single template can generate reports across different AWS Regions or customer segments.&lt;/li&gt; 
 &lt;li&gt;Orchestrate multi-notebook workflows where one notebook’s output feeds into the next. For example, an extract, transform, and load (ETL) pipeline followed by a summary dashboard refresh.&lt;/li&gt; 
 &lt;li&gt;Debug failed runs with AI-assisted troubleshooting.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;Sample use case overview&lt;/h2&gt; 
&lt;p&gt;In this walkthrough, you will take on the role of a logistics analyst who monitors shipping performance across carriers. The notebook loads shipping data from the &lt;code&gt;ShippingLogs.csv&lt;/code&gt; dataset, identifies late deliveries, and generates a performance summary. You want to run this notebook every morning without manual intervention, reuse it across different carriers, and know when something goes wrong.&lt;/p&gt; 
&lt;p&gt;You will start by running a notebook in the background and viewing the results. Next, you will create a recurring schedule for daily runs, then parameterize the notebook to generate reports for different carriers. You will also orchestrate the notebook in a multi-step workflow and debug a failed run using AI-assisted troubleshooting.&lt;/p&gt; 
&lt;h2&gt;Prerequisites&lt;/h2&gt; 
&lt;p&gt;Before you begin, you need:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;An Amazon SageMaker Unified Studio project with Notebooks enabled. See &lt;a href="https://docs.aws.amazon.com/sagemaker-unified-studio/latest/adminguide/setup-iam-based-domains.html" target="_blank" rel="noopener"&gt;Set up IAM-based domains&lt;/a&gt; for permission requirements.&lt;/li&gt; 
 &lt;li&gt;A sample dataset. We use the &lt;code&gt;ShippingLogs.csv&lt;/code&gt; dataset, which contains shipping data including estimated and actual delivery times, carriers, and origins. You can download it from the &lt;a href="https://catalog.us-east-1.prod.workshops.aws/workshops/80ba0ea5-7cf9-4b8c-9d3f-1cd988b6c071/en-US/zzz-legacy/1-use-cases/7-supply-chain#loading-the-dataset" target="_blank" rel="noopener"&gt;Workshop Studio&lt;/a&gt; (the file is named &lt;code&gt;ShippingLogs.csv&lt;/code&gt; on the linked page).&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;Setting up the notebook&lt;/h2&gt; 
&lt;p&gt;Start by creating a new notebook in your SageMaker Unified Studio project. If you haven’t already, upload the &lt;code&gt;ShippingLogs.csv&lt;/code&gt; file under the Shared tab in the Files panel.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/27/BDB-6003-1.jpg" alt="SageMaker Unified Studio Notebook Files panel showing the Shared tab with the ShippingLogs.csv dataset uploaded" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;In the first cell, we load and explore the dataset. To reference the file in code, select the file in the Shared tab and copy the Amazon Simple Storage Service (Amazon S3) URI shown in the file details. Alternatively, you can reference it with this code:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-python"&gt;import pandas as pd
from sagemaker_studio import Project

# Initialize the project
proj = Project()

# Get the S3 root path
s3_root = proj.s3.root

df = pd.read_csv(s3_root + '/ShippingLogs.csv')
df.head()&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;The dataset contains columns including &lt;code&gt;Carrier&lt;/code&gt;, &lt;code&gt;ActualShippingDays&lt;/code&gt;, &lt;code&gt;ExpectedShippingDays&lt;/code&gt;, &lt;code&gt;ShippingOrigin&lt;/code&gt;, &lt;code&gt;ShippingPriority&lt;/code&gt;, and &lt;code&gt;OnTimeDelivery&lt;/code&gt;. Add a second cell to analyze shipping performance for a single carrier:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-python"&gt;import matplotlib.pyplot as plt

carrier_data = df[df['Carrier'] == 'GlobalFreight']
# Flag late deliveries
carrier_data['is_late'] = carrier_data['ActualShippingDays'] &amp;gt; carrier_data['ExpectedShippingDays']
late_pct = carrier_data['is_late'].mean() * 100
# Visualize actual vs expected shipping days
plt.figure(figsize=(12, 4))
plt.hist(carrier_data['ActualShippingDays'] - carrier_data['ExpectedShippingDays'], bins=20, edgecolor='black')
plt.axvline(x=0, color='red', linestyle='--', label='On time')
plt.title(f'Shipping Delay Distribution - GlobalFreight ({late_pct:.1f}% late)')
plt.xlabel('Days Over Expected')
plt.ylabel('Number of Shipments')
plt.legend()
plt.show()&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;With the notebook working interactively, you’re ready to automate it.&lt;/p&gt; 
&lt;h2&gt;Running a notebook asynchronously&lt;/h2&gt; 
&lt;p&gt;To trigger an asynchronous run, open your notebook. In the notebook header, choose the menu on the &lt;strong&gt;Run all&lt;/strong&gt; button, and then choose &lt;strong&gt;Run in background&lt;/strong&gt;.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/27/BDB-6003-2.jpg" alt="Notebook header with the Run all menu expanded, showing the Run in background option" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;This captures a snapshot of the notebook in its current state and starts a run on a separate dedicated compute. You can continue working on other tasks or close the browser entirely. Your interactive session isn’t affected.&lt;/p&gt; 
&lt;p&gt;You will see a notification at the bottom of your screen confirming that the run started. To check the status of your run, choose &lt;strong&gt;View Run&lt;/strong&gt; in the notification. This opens a view showing every background and scheduled run with its status, duration, and a link to view the full output.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/27/BDB-6003-3.jpg" alt="Run history view showing background and scheduled runs with status, duration, and output links" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;You can choose to view the run details at any point to view results as cells run. The run details include three tabs:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Output&lt;/strong&gt;: The notebook in read-only mode with cell results rendered, including dataframe outputs, visualizations, and print statements.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Parameters&lt;/strong&gt;: The parameter values used for this run.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Logs&lt;/strong&gt;: Run logs for debugging.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/27/BDB-6003-4.jpg" alt="Run details view showing the Output, Parameters, and Logs tabs with rendered cell output" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;You can also access past runs by selecting the &lt;strong&gt;View Runs&lt;/strong&gt; option in the notebook header.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/27/BDB-6003-5.jpg" alt="Notebook header with the View Runs option highlighted" width="600"&gt;&lt;/p&gt; 
&lt;h3&gt;Stopping an in-progress run&lt;/h3&gt; 
&lt;p&gt;If you need to cancel a run, open the run, and choose &lt;strong&gt;Stop&lt;/strong&gt;. The run terminates, and its status updates to reflect the cancellation.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/27/BDB-6003-6.jpg" alt="Run detail view with the Stop button selected to terminate an in-progress run" width="600"&gt;&lt;/p&gt; 
&lt;h3&gt;What to know about background runs&lt;/h3&gt; 
&lt;p&gt;&lt;strong&gt;Compute:&lt;/strong&gt; Each background run uses its own dedicated compute, separate from your interactive session. Your interactive work isn’t interrupted.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Packages:&lt;/strong&gt; The packages that you install through the notebook’s package manager will be available in your background runs. When you use &lt;code&gt;!pip install&lt;/code&gt; in code cells, the asynchronous run installs those packages as well.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Local files:&lt;/strong&gt; Background runs can’t access files stored locally in your notebook environment. Reference data from your project’s shared storage (Amazon S3) or connected data sources instead.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Startup time:&lt;/strong&gt; Expect a few minutes of startup time while compute is provisioned and your environment is prepared.&lt;/p&gt; 
&lt;h2&gt;Creating a recurring schedule&lt;/h2&gt; 
&lt;p&gt;Now that you’ve confirmed asynchronous runs work correctly, you can automate the notebook on a schedule. Choose the schedule icon in the notebook header to open the schedule creation form.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/27/BDB-6003-7.jpg" alt="Schedule creation form opened from the notebook header schedule icon" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;Configure the following settings:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Schedule name&lt;/strong&gt;: Enter a descriptive name, such as Daily Shipping Report.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Schedule type&lt;/strong&gt;: Choose &lt;strong&gt;Recurring&lt;/strong&gt; for repeated runs or &lt;strong&gt;One-time&lt;/strong&gt; for a single future run.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Frequency&lt;/strong&gt;: Define how often the notebook runs using a rate (for example, every one day) or a cron expression. Set the time zone and the start and end dates for the schedule. For example, set the schedule to run every day at 7:00 AM UTC starting tomorrow.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Flexible time window&lt;/strong&gt; (optional): The number of minutes after the scheduled start time within which the run can be invoked. For example, with a 5-minute window, the notebook runs within 5 minutes of the start time.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Advanced settings:&lt;/strong&gt; 
  &lt;ul&gt; 
   &lt;li&gt;&lt;strong&gt;Compute Instance&lt;/strong&gt;: Keep the current settings or override with a different instance type for the asynchronous run to use.&lt;/li&gt; 
   &lt;li&gt;&lt;strong&gt;Timeout&lt;/strong&gt;: Set a maximum run duration to help prevent notebooks from running indefinitely. If left blank, it defaults to 60 minutes.&lt;/li&gt; 
  &lt;/ul&gt; &lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;Choose &lt;strong&gt;Create&lt;/strong&gt;.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/27/BDB-6003-8.jpg" alt="Configured schedule form with name, recurring type, daily frequency, and advanced settings populated" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;The schedule appears in the &lt;strong&gt;Schedules&lt;/strong&gt; tab of the activity panel. SageMaker Unified Studio creates an Amazon EventBridge Scheduler schedule for each schedule you configure.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/27/BDB-6003-9.jpg" alt="Schedules tab in the activity panel listing the newly created Daily Shipping Report schedule" width="600"&gt;&lt;/p&gt; 
&lt;h3&gt;Viewing schedule run history&lt;/h3&gt; 
&lt;p&gt;To view past runs for a schedule, choose the schedule name in the Schedules activity panel. This opens the schedule details view, where you can see the list of runs triggered by that schedule, the duration of each run, and a link to open the notebook output for an individual run.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/27/BDB-6003-10v2.png" alt="Schedule details view showing the list of past runs with status, duration, and output links" width="600"&gt;&lt;/p&gt; 
&lt;h3&gt;Editing and deleting schedules&lt;/h3&gt; 
&lt;p&gt;To modify a schedule, choose &lt;strong&gt;Edit&lt;/strong&gt; next to it in the Schedules panel. You can change the frequency, instance type, timeout, and other configuration fields. To pause or resume a schedule, choose &lt;strong&gt;Pause&lt;/strong&gt; or &lt;strong&gt;Resume&lt;/strong&gt; from the same menu. To remove a schedule, choose &lt;strong&gt;Delete&lt;/strong&gt; from that menu. Deleting a schedule stops future runs but preserves historical run outputs in Amazon S3 for auditing purposes.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/27/BDB-6003-11.jpg" alt="Schedules panel with the Edit, Pause, Resume, and Delete options for a schedule" width="600"&gt;&lt;/p&gt; 
&lt;h2&gt;Parameterizing notebooks&lt;/h2&gt; 
&lt;p&gt;With parameters, you can reuse a single notebook across different inputs without duplicating code. For example, you can run the same shipping performance report for each carrier by passing a different carrier name to each run.&lt;/p&gt; 
&lt;h3&gt;Defining parameters&lt;/h3&gt; 
&lt;p&gt;Open the Parameters activity panel and choose &lt;strong&gt;Add&lt;/strong&gt;. Set the parameter name to &lt;code&gt;carrier&lt;/code&gt; and the default value to &lt;code&gt;GlobalFreight&lt;/code&gt;.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/27/BDB-6003-12.jpg" alt="Parameters activity panel with the carrier parameter and GlobalFreight default value configured" width="600"&gt;&lt;/p&gt; 
&lt;h3&gt;Using parameters in code&lt;/h3&gt; 
&lt;p&gt;In your notebook, replace the second cell with the following code. This retrieves the &lt;code&gt;carrier&lt;/code&gt; parameter value using the SageMaker Unified Studio Python SDK instead of the hardcoded value:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-python"&gt;import sagemaker_studio
import matplotlib.pyplot as plt

carrier = sagemaker_studio.nbutils.parameters.get("carrier")

carrier_data = df[df['Carrier'] == carrier].copy()
carrier_data['is_late'] = carrier_data['ActualShippingDays'] &amp;gt; carrier_data['ExpectedShippingDays']
late_pct = carrier_data['is_late'].mean() * 100

plt.figure(figsize=(12, 4))
plt.hist(carrier_data['ActualShippingDays'] - carrier_data['ExpectedShippingDays'], bins=20, edgecolor='black')
plt.axvline(x=0, color='red', linestyle='--', label='On time')
plt.title(f'Shipping Delay Distribution - {carrier} ({late_pct:.1f}% late)')
plt.xlabel('Days Over Expected')
plt.ylabel('Number of Shipments')
plt.legend()
plt.show()&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;h3&gt;Creating schedules with different parameter values&lt;/h3&gt; 
&lt;p&gt;Now create three schedules for the same notebook, each targeting a different carrier:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;“daily-shipping-gf” with &lt;code&gt;carrier&lt;/code&gt; = &lt;code&gt;GlobalFreight&lt;/code&gt;.&lt;/li&gt; 
 &lt;li&gt;“daily-shipping-mc” with &lt;code&gt;carrier&lt;/code&gt; = &lt;code&gt;MicroCarrier&lt;/code&gt;.&lt;/li&gt; 
 &lt;li&gt;“daily-shipping-shipper” with &lt;code&gt;carrier&lt;/code&gt; = &lt;code&gt;Shipper&lt;/code&gt;.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;When you view a historical run, a separate &lt;strong&gt;Parameters&lt;/strong&gt; tab in the run output displays the parameter values that were active for that run.&lt;/p&gt; 
&lt;p&gt;You can also override parameter values when triggering an on-demand background run. Choose the menu on the &lt;strong&gt;Run all&lt;/strong&gt; button, then choose &lt;strong&gt;Run with settings&lt;/strong&gt;. You can keep the defaults or provide custom values for that run.&lt;/p&gt; 
&lt;h2&gt;Orchestrating with Workflows&lt;/h2&gt; 
&lt;p&gt;To combine notebooks into a multi-step pipeline, such as running a data calculation notebook before the shipping log notebook, you can use the &lt;strong&gt;Notebook Operator&lt;/strong&gt; in the Workflows tool to orchestrate them.&lt;/p&gt; 
&lt;p&gt;To do this, choose the &lt;strong&gt;Add to workflows&lt;/strong&gt; button under the options menu of the notebook header.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/27/BDB-6003-13.jpg" alt="Notebook header options menu with the Add to workflows button highlighted" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;This takes you to the Workflows tool, adding a new Notebook Operator task with prefilled properties from your notebook. When configuring the Operator task:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Select the target notebook from the notebook menu.&lt;/li&gt; 
 &lt;li&gt;Use the Parameters widget to pass notebook parameters into the run of the notebook.&lt;/li&gt; 
 &lt;li&gt;Specify optional arguments such as the compute instance and timeout configuration for the run.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/27/BDB-6003-14.jpg" alt="Workflows canvas with a Notebook Operator task configured with notebook, parameters, and compute settings" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;Workflows also supports polling for the status of a notebook run for a particular notebook using &lt;strong&gt;Notebook Sensor&lt;/strong&gt;. In Workflows, you can add a new Sensor task by hovering on the edge of the existing Operator task, where a &lt;strong&gt;plus (+)&lt;/strong&gt; button is displayed.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/27/BDB-6003-15.jpg" alt="Workflows canvas showing the plus button on the edge of an Operator task for adding a Sensor" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;You can then search for and add the Notebook Sensor to the canvas.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/27/BDB-6003-16.jpg" alt="Task picker dialog with Notebook Sensor selected for adding to the workflow canvas" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;When configuring the Sensor task, specify the notebook run ID within the text field. The Operator’s form field contains Jinja templating to retrieve the notebook run. If the Sensor is used within the same workflow as the Operator, this template can be copied to use within a Sensor to poll the notebook run. Select the target notebook from the notebook menu.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/27/BDB-6003-17.jpg" alt="Notebook Sensor configuration panel with the notebook run ID field populated using Jinja templating" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;Within Workflows, you can configure notebook runs to emit outputs and use those outputs as inputs for subsequent notebook runs.&lt;/p&gt; 
&lt;p&gt;Building off of the previous shipping log notebook example, we will pass the &lt;code&gt;carrier&lt;/code&gt; parameter from an upstream notebook’s output. Your &lt;code&gt;shipping-logs-analysis&lt;/code&gt; notebook should be already set up.&lt;/p&gt; 
&lt;p&gt;Because the notebook depends on the &lt;code&gt;carrier&lt;/code&gt; parameter, you can specify it in the Parameters panel.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/27/BDB-6003-18.jpg" alt="Parameters panel for the shipping-logs-analysis Operator with the carrier parameter dependency configured" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;Now, define a second notebook, &lt;code&gt;calculate-best-carrier&lt;/code&gt;, which performs a calculation to determine our best carrier to use for shipping:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-python"&gt;import pandas as pd
from sagemaker_studio import Project

# Initialize the project
proj = Project()

# Get the S3 root path
s3_root = proj.s3.root

df = pd.read_csv(s3_root + '/ShippingLogs.csv')
df.head()

carrier_stats = df.groupby('Carrier').agg(
    total=('OrderID', 'count'),
    late=('OnTimeDelivery', lambda x: (x == 'Late').sum())
).reset_index()
carrier_stats['late_pct'] = carrier_stats['late'] / carrier_stats['total'] * 100

best = carrier_stats.sort_values('late_pct', ascending=True).iloc[0]
best_carrier = best['Carrier']

print("Late % by carrier:")
print(carrier_stats.to_string(index=False))
print(f"\nBest carrier: {best_carrier} ({best['late_pct']:.1f}% late)")&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;To configure the &lt;code&gt;calculate-best-carrier&lt;/code&gt; notebook’s outputs, you can choose the Variables panel. A new selector is available at the bottom of this panel which allows you to select variables to mark as outputs.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/27/BDB-6003-19.jpg" alt="Variables panel with the selector at the bottom for marking notebook variables as outputs" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;We want this notebook to emit the &lt;code&gt;best_carrier&lt;/code&gt; variable.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/27/BDB-6003-20.jpg" alt="Variables panel showing best_carrier marked as an output variable for the calculate-best-carrier notebook" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;Now, use the &lt;strong&gt;Add to workflows&lt;/strong&gt; button as previously demonstrated to quickly add this notebook within a workflow. Chain a second Notebook Operator that points to our &lt;code&gt;shipping-logs-analysis&lt;/code&gt; notebook. Because we specified a parameter dependency on &lt;code&gt;carrier&lt;/code&gt; for this notebook, it’s available as an option in the Parameters widget menu.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/27/BDB-6003-21.jpg" alt="Parameters widget menu of a Notebook Operator showing carrier as a configurable parameter dependency" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;When they’re chained, the notebook tasks detect the outputs set in upstream notebook runs. These outputs can be selected as keys within the Parameters widget of the Operator to pass into the run. This can be done recursively for an arbitrary number of Operator tasks. We can select the emitted &lt;code&gt;best_carrier&lt;/code&gt; output from the &lt;code&gt;calculate-best-carrier&lt;/code&gt; notebook.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/27/BDB-6003-22.jpg" alt="Parameters widget displaying best_carrier as a selectable upstream output to pass into the next Operator" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;You can now choose the &lt;strong&gt;Save&lt;/strong&gt; button on the top left of the visual canvas and the &lt;strong&gt;Run&lt;/strong&gt; button to start the workflow. When the workflow is completed, the specified notebook outputs are available in the Task Output panel and the notebook run result can be viewed in the Notebooks tool.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/27/BDB-6003-23.jpg" alt="Task Output panel showing the emitted notebook outputs after a successful workflow run" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/27/BDB-6003-24.jpg" alt="Notebook run result rendered in the Notebooks tool after the chained workflow completes" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;In a similar manner, the Notebook Sensor will also emit the notebook outputs from a particular notebook’s run which can be used within other tasks. This is useful when you want to retrieve outputs from a notebook run in another workflow.&lt;/p&gt; 
&lt;h2&gt;Debugging a failed run with AI assistance&lt;/h2&gt; 
&lt;p&gt;When viewing your past runs, you notice that a run from earlier today has a Failed status. Choose the failed run to open the notebook output in read-only mode.&lt;/p&gt; 
&lt;p&gt;In this example, suppose you incorrectly referred to column name &lt;code&gt;ActualShippingDays&lt;/code&gt; as &lt;code&gt;DeliveryDays&lt;/code&gt;. The run would fail with a &lt;code&gt;KeyError: 'DeliveryDays'&lt;/code&gt; in the cell that computes late deliveries.&lt;/p&gt; 
&lt;p&gt;At the top of the failed run output, choose &lt;strong&gt;Troubleshoot with AI&lt;/strong&gt;. Choosing the Troubleshoot with AI button lands you in the notebook with the Agent chat panel open.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/27/BDB-6003-25.jpg" alt="Failed run output with the Troubleshoot with AI button highlighted at the top of the page" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;The data agent analyzes the cell outputs, identifies the cell that errored, explains the root cause, and suggests a fix. In this case, it identifies that the column &lt;code&gt;DeliveryDays&lt;/code&gt; doesn’t exist in the dataframe and suggests updating the code reference. You can review the change, then verify the fix by choosing &lt;strong&gt;Run in background&lt;/strong&gt; from the &lt;strong&gt;Run all&lt;/strong&gt; menu to trigger a test run before the next scheduled run.&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone wp-image-91543 size-large" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/27/BDB-6003-26-443x1024.jpg" alt="" width="443" height="1024"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Note&lt;/strong&gt;: You can also use the Data Agent to create schedules and start notebook runs using natural language, without having to navigate.&lt;/p&gt; 
&lt;h2&gt;Cleaning up&lt;/h2&gt; 
&lt;p&gt;To avoid incurring future charges, delete the resources that you created in this walkthrough:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Delete any schedules that you created from the Schedules panel in your notebook.&lt;/li&gt; 
 &lt;li&gt;Delete test notebooks if you don’t need them.&lt;/li&gt; 
 &lt;li&gt;Navigate to the Workflows page and delete any workflows that you created during this walkthrough.&lt;/li&gt; 
 &lt;li&gt;Your project’s Amazon S3 storage retains historical run outputs until you manually remove them.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;Conclusion&lt;/h2&gt; 
&lt;p&gt;In this post, we showed how to run notebooks in the background in Amazon SageMaker Unified Studio using background runs, schedules, parameterization, workflow orchestration, and AI-assisted debugging. Using a shipping logistics dataset, we demonstrated how a single notebook can be parameterized to generate performance reports for different carriers on independent schedules, all without duplicating code or managing extensive infrastructure.&lt;/p&gt; 
&lt;p&gt;To get started, open a notebook in your SageMaker Unified Studio project, choose the menu on the &lt;strong&gt;Run all&lt;/strong&gt; button in the notebook header, and choose &lt;strong&gt;Run in background&lt;/strong&gt;. For more advanced use cases, explore workflows in Amazon SageMaker Unified Studio to build multi-step data pipelines, or review the Amazon SageMaker Unified Studio User Guide for additional configuration options.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Learn more:&lt;/strong&gt;&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/navigating-sagemaker-unified-studio.html" target="_blank" rel="noopener"&gt;Amazon SageMaker Unified Studio User Guide&lt;/a&gt;.&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/notebooks.html" target="_blank" rel="noopener"&gt;Notebooks in SageMaker Unified Studio&lt;/a&gt;.&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/manage-compute-environments.html" target="_blank" rel="noopener"&gt;Manage compute environments&lt;/a&gt;.&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://aws.amazon.com/sagemaker/pricing/" target="_blank" rel="noopener"&gt;Amazon SageMaker Unified Studio pricing&lt;/a&gt;.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;If you have feedback or questions, reach out on &lt;a href="https://repost.aws/tags/TAdXqriMJIT6CL4ervYlUgow/amazon-sagemaker-unified-studio" target="_blank" rel="noopener"&gt;AWS re:Post for Amazon SageMaker Unified Studio&lt;/a&gt;.&lt;/p&gt; 
&lt;hr&gt; 
&lt;h2&gt;About the authors&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt; 
   &lt;p&gt;&lt;img loading="lazy" class="alignleft size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/27/BDB-6003-27.jpg" alt="Shivani Mehendarge" width="100" height="100"&gt;&lt;/p&gt; 
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Shivani Mehendarge&lt;/h3&gt; 
  &lt;p&gt;Shivani is a Software Development Engineer at Amazon Web Services, where she builds scalable infrastructure that helps data teams run and automate their workloads in Amazon SageMaker Unified Studio. She is passionate about solving complex distributed systems challenges and building reliable cloud services.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt; 
   &lt;p&gt;&lt;img loading="lazy" class="alignleft size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/27/BDB-6003-28.jpg" alt="Regan Perk" width="100" height="100"&gt;&lt;/p&gt; 
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Regan Perk&lt;/h3&gt; 
  &lt;p&gt;Regan is a Senior Software Development Engineer on the Amazon SageMaker Unified Studio team. She designs, implements, and maintains features that enable customers to manage schedules and workflows in SageMaker Unified Studio.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt; 
   &lt;p&gt;&lt;img loading="lazy" class="alignleft size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/27/BDB-6003-29.jpg" alt="Qazi Ashikin" width="100" height="100"&gt;&lt;/p&gt; 
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Qazi Ashikin&lt;/h3&gt; 
  &lt;p&gt;Qazi is a Software Development Engineer at Amazon Web Services, where he works on developing features that allow customers to orchestrate workflows and schedules in SageMaker Unified Studio. He also works on AWS Glue Studio, where he builds agentic systems and maintains services that enable data analytics.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
		
		
			</item>
		<item>
		<title>Amazon OpenSearch Service: Mechanisms to secure your domain</title>
		<link>https://aws.amazon.com/blogs/big-data/amazon-opensearch-service-mechanisms-to-secure-your-domain/</link>
		
		<dc:creator><![CDATA[Imtiaz Sayed]]></dc:creator>
		<pubDate>Wed, 03 Jun 2026 15:21:53 +0000</pubDate>
				<category><![CDATA[Amazon OpenSearch Service]]></category>
		<category><![CDATA[Intermediate (200)]]></category>
		<category><![CDATA[Technical How-to]]></category>
		<category><![CDATA[Uncategorized]]></category>
		<guid isPermaLink="false">d949974511d4fbfbe79dcca360fae9bee15f8c55</guid>

					<description>This post offers an overview of the security mechanisms available for Amazon OpenSearch Service, spanning authentication and authorization, encryption, and network access controls. You learn how to implement fine-grained access control, manage AWS Identity and Access Management (IAM) roles, and secure data both in transit and at rest for both public and virtual private cloud (VPC) access domains.</description>
										<content:encoded>&lt;p&gt;Imagine you’re building a product search feature for your website or storing customer records in Amazon OpenSearch Service to power full-text search. The moment that real user data enters your domain, security becomes essential.&lt;/p&gt; 
&lt;p&gt;Whether your workload is a public-facing website search, an internal application querying sensitive data, or a pipeline handling personally identifiable information (PII), the questions you face are the same:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Who should be allowed to connect to my domain?&lt;/li&gt; 
 &lt;li&gt;How do I authenticate users and services?&lt;/li&gt; 
 &lt;li&gt;How do I make sure that even authenticated users only see data they are entitled to see?&lt;/li&gt; 
 &lt;li&gt;How do I satisfy regulatory requirements such as HIPAA, PCI DSS, or SOC 2?&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;This post offers an overview of the security mechanisms available for &lt;a href="https://aws.amazon.com/opensearch-service/" target="_blank" rel="noopener"&gt;Amazon OpenSearch Service&lt;/a&gt;, spanning authentication and authorization, encryption, and network access controls. You learn how to implement fine-grained access control, manage &lt;a href="https://aws.amazon.com/iam/" target="_blank" rel="noopener"&gt;AWS Identity and Access Management (IAM)&lt;/a&gt; roles, and secure data both in transit and at rest for both public and virtual private cloud (VPC) access domains.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Scope:&lt;/strong&gt; This post covers security for Amazon OpenSearch Service managed clusters only. It doesn’t cover Amazon OpenSearch Serverless, which uses a different security model. For serverless security, see &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/security.html" target="_blank" rel="noopener"&gt;Amazon OpenSearch Serverless security&lt;/a&gt; in the AWS documentation.&lt;/p&gt; 
&lt;p&gt;To begin, let’s look at the security layers in Amazon OpenSearch Service.&lt;/p&gt; 
&lt;h2&gt;Amazon OpenSearch Service security layers&lt;/h2&gt; 
&lt;p&gt;Amazon OpenSearch Service has multi-layer security. The following diagram illustrates the multi-layer security in Amazon OpenSearch Service.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/28/BDB-5059-1.png" alt="Diagram showing the three security layers of Amazon OpenSearch Service: Network, Domain access policy, and Fine-grained access control" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;Figure 1: Multi-layer security.&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;The three main layers of security are network, domain access policy, and fine-grained access control.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Network&lt;/strong&gt; – The first security layer is the network, which determines whether requests reach an OpenSearch Service domain. If you choose &lt;strong&gt;Public access&lt;/strong&gt; when you create a domain, requests from any internet-connected client can reach the domain endpoint. If you choose &lt;a href="https://docs.aws.amazon.com/vpc/latest/userguide/what-is-amazon-vpc.html" target="_blank" rel="noopener"&gt;VPC access&lt;/a&gt;, clients must connect to the Amazon Virtual Private Cloud (Amazon VPC) (and the associated security groups must permit it) for a request to reach the endpoint.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Domain access policy&lt;/strong&gt; – The second security layer is the domain access policy. After a request reaches a domain endpoint, the &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/ac.html#ac-types-resource" target="_blank" rel="noopener"&gt;resource-based access policy&lt;/a&gt; allows or denies the request access to a given URI. The access policy accepts or rejects requests at the edge of the domain, before they reach data or indexes in OpenSearch itself.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Fine-grained access control&lt;/strong&gt; – The third and final security layer is &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/fgac.html" target="_blank" rel="noopener"&gt;fine-grained access control&lt;/a&gt;. After a resource-based access policy allows a request to reach a domain endpoint, fine-grained access control evaluates the user credentials and either authenticates the user or denies the request. If fine-grained access control authenticates the user, it fetches all OpenSearch internal roles mapped to that user and uses the full set of permissions to determine how to handle the request.&lt;/p&gt; 
&lt;p&gt;With fine-grained access control, you can control access to your data in Amazon OpenSearch Service. For example, depending on who makes the request, you might want to hide certain fields in your documents or exclude certain documents altogether. With fine-grained access control, you can:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Define role-based access control to determine who can perform which actions on which indexes, documents, and fields.&lt;/li&gt; 
 &lt;li&gt;Define security at the index, document, and field level to allow access to only required data.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;Fine-grained access control requires OpenSearch or &lt;a href="https://aws.amazon.com/what-is/elasticsearch/" target="_blank" rel="noopener"&gt;Elasticsearch&lt;/a&gt; 6.7 or later. It also requires HTTPS for all traffic to the domain, &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/encryption-at-rest.html" target="_blank" rel="noopener"&gt;encryption of data at rest&lt;/a&gt;, and &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/ntn.html" target="_blank" rel="noopener"&gt;node-to-node encryption&lt;/a&gt;. Depending on how you configure the advanced features of fine-grained access control, more processing of your requests might require compute and memory resources on individual data nodes. After you turn on fine-grained access control, you can’t turn it off. For more details, see &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/fgac.html" target="_blank" rel="noopener"&gt;Fine-grained access control in Amazon OpenSearch Service&lt;/a&gt; in the AWS documentation.&lt;/p&gt; 
&lt;p&gt;To learn more about security features in an OpenSearch Service domain, let’s start by configuring a new public access domain. We discuss a VPC access domain later in the post.&lt;/p&gt; 
&lt;h2&gt;Public access domain&lt;/h2&gt; 
&lt;p&gt;With a public access domain, you can configure an OpenSearch Service domain so that the domain endpoint is accessible from the internet.&lt;/p&gt; 
&lt;p&gt;The AWS console for Amazon OpenSearch Service provides a guided wizard that you can use to configure and reconfigure your provisioned Amazon OpenSearch Service domains. Follow the &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/fgac-http-auth.html" target="_blank" rel="noopener"&gt;Tutorial: Configure a domain with the internal user database and HTTP basic authentication&lt;/a&gt; in the AWS documentation to configure a domain with basic authentication and validate fine-grained access control.&lt;/p&gt; 
&lt;p&gt;Let’s review some important configuration attributes for a public access domain.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Network:&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;Public access. To simplify the network access configurations, you can use &lt;strong&gt;Public access&lt;/strong&gt;, but for production workloads, we recommend &lt;strong&gt;VPC access&lt;/strong&gt;.&lt;/p&gt; 
&lt;p&gt;With the domain in public access, you have several options to secure access. While you can use a resource-based access policy to restrict access to specific IAM principals or IP addresses, the recommended approach is to turn on &lt;strong&gt;fine-grained access control (FGAC)&lt;/strong&gt; and use it as the primary mechanism for securing your domain. With FGAC turned on, you can set an open access policy (allowing all traffic to reach the domain) and let FGAC handle authentication and authorization at the index, document, and field level.&lt;/p&gt; 
&lt;p&gt;When using IAM-based authentication with FGAC, you should map IAM roles to &lt;strong&gt;backend roles&lt;/strong&gt; in OpenSearch. You can use backend roles to assign permissions to groups of users based on their IAM role, rather than managing individual user mappings. This is especially important because if your IAM federation or authentication mechanism changes, the backend role mappings make sure of consistent access control within OpenSearch.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/28/BDB-5059-2.png" alt="Amazon OpenSearch Service network configuration screen with Public access selected" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;Figure 2: Use public access domain.&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Fine-grained access control:&lt;/strong&gt; Fine-grained access control provides numerous features to help you keep your data secure, such as document-level security, field-level security, read-only users, and OpenSearch Dashboards/Kibana tenants. Fine-grained access control requires a primary user, which is the administrator identity we discuss through the rest of this post.&lt;/p&gt; 
&lt;p&gt;The primary user is the administrator identity for your OpenSearch domain. This user can set up additional users in Amazon OpenSearch Service, assign roles to them, and assign permissions for those roles. You can choose username and password authentication for the primary user or use an IAM identity. You use these credentials to log in to OpenSearch Dashboards. Following the best practices on choosing your primary user, you should move to an IAM primary user for production workloads.&lt;/p&gt; 
&lt;p&gt;Fine-grained access control can be applied regardless of how you log in. You can follow your organization’s suggested authentication mechanism and apply fine-grained access control on top of it.&lt;/p&gt; 
&lt;p&gt;FGAC provides security at multiple levels to meet your security needs:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Index-level security – Controls who can create, search, read, write, update, or delete within specific indexes.&lt;/li&gt; 
 &lt;li&gt;Document-level security – Restricts which documents within an index a user can see, using OpenSearch query filters (for example, only show documents where department: “sales”).&lt;/li&gt; 
 &lt;li&gt;Field-level security – Controls which fields within documents are visible (include or exclude specific fields).&lt;/li&gt; 
 &lt;li&gt;Field masking – Anonymizes sensitive field data (for example, hash a release_date or SSN field) rather than hiding it entirely.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;Fine-grained access control supports several authentication mechanisms, including HTTP basic authentication using an internal user database, Amazon Cognito for web-based Dashboards access, SAML for enterprise identity provider integration, JSON Web Tokens (JWT) for token-based authentication, and AWS Identity and Access Management with SigV4 signing for IAM users and roles.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Encryption:&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;Amazon OpenSearch Service encrypts data both in transit and at rest. When you turn on fine-grained access control, encryption is required—the corresponding settings are automatically turned on and can’t be changed. These include &lt;a href="https://en.wikipedia.org/wiki/Transport_Layer_Security" target="_blank" rel="noopener"&gt;Transport Layer Security (TLS 1.2 or later)&lt;/a&gt; for requests to the domain and for traffic between nodes in the domain, and encryption of data at rest through &lt;a href="https://aws.amazon.com/kms/" target="_blank" rel="noopener"&gt;AWS Key Management Service (AWS KMS)&lt;/a&gt;.&lt;/p&gt; 
&lt;p&gt;For encryption at rest, OpenSearch Service supports three key types: &lt;em&gt;AWS owned keys&lt;/em&gt;, &lt;em&gt;AWS managed keys&lt;/em&gt;, and &lt;em&gt;customer managed keys&lt;/em&gt;. While AWS owned keys provide a quick-start option with no additional configuration, &lt;em&gt;customer managed keys are the recommended best practice&lt;/em&gt;. Customer managed keys give you full control over the encryption key lifecycle, including key rotation policies, granular access control through key policies, and the ability to audit key usage through AWS CloudTrail. To use a customer managed key, create a symmetric encryption key in AWS KMS and select it when configuring your domain’s encryption settings.&lt;/p&gt; 
&lt;p&gt;For a &lt;em&gt;basic public access domain&lt;/em&gt; with FGAC, all traffic reaches the domain freely (no VPC restriction), and an open access policy is used so no SigV4 signing is needed. FGAC then takes over, authenticating users through the internal user database (username/password) and enforcing role-based permissions at the index, document, and field level.&lt;/p&gt; 
&lt;p&gt;The public access configuration we discussed is useful for development and testing, but for &lt;em&gt;production workloads&lt;/em&gt;, a best practices deployment combines VPC access, IAM-based authentication, and fine-grained access control. This approach layers all three security mechanisms—network isolation, identity verification, and granular permissions—to protect your domain end to end.&lt;/p&gt; 
&lt;h2&gt;VPC access domain&lt;/h2&gt; 
&lt;p&gt;Placing your OpenSearch Service domain inside a VPC restricts network-level access to resources within the VPC or connected networks. Traffic between your applications and the OpenSearch endpoint doesn’t traverse the public internet, and you can use security groups to further limit which entities can communicate with the domain. OpenSearch Service places a VPC endpoint (VPCe) using AWS PrivateLink into one, two, or three subnets of your VPC depending on your Availability Zone configuration. For high availability (HA), turn on &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/managedomains-multiaz.html" target="_blank" rel="noopener"&gt;multiple Availability Zones&lt;/a&gt; with each subnet in a different zone within the same AWS Region. For more details, see &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/vpc.html" target="_blank" rel="noopener"&gt;Launching your Amazon OpenSearch Service domains within a VPC&lt;/a&gt;.&lt;/p&gt; 
&lt;p&gt;For this best practices deployment, we use an &lt;em&gt;IAM primary user&lt;/em&gt; with &lt;em&gt;Amazon Cognito authentication&lt;/em&gt; for OpenSearch Dashboards and for fine-grained access control. We configure a primary IAM role and a limited IAM role, associate them with users in Amazon Cognito through a user pool and identity pool, and then use fine-grained access control to manage permissions. The primary user can then sign in to OpenSearch Dashboards, create backend roles, map the limited user to a restricted role, and enforce granular access at the index, document, and field level. For more details, see &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/fgac-iam.html" target="_blank" rel="noopener"&gt;Tutorial: Configure a domain with an IAM master user and Amazon Cognito authentication&lt;/a&gt; in the AWS documentation.&lt;/p&gt; 
&lt;p&gt;The following high-level steps detail what’s needed to configure a VPC access domain with Amazon Cognito users. These steps use the Amazon Cognito user pool for authentication. The same basic process works for any Cognito authentication provider that lets you assign different IAM roles to different users.&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Create an Amazon Cognito user pool.&lt;/li&gt; 
 &lt;li&gt;Add users in the user pool for the primary user and a limited-access user.&lt;/li&gt; 
 &lt;li&gt;Create an Amazon Cognito identity pool.&lt;/li&gt; 
 &lt;li&gt;Update the IAM role for the primary user to allow access to OpenSearch Dashboards.&lt;/li&gt; 
 &lt;li&gt;Create an IAM role for the limited user.&lt;/li&gt; 
 &lt;li&gt;Create the domain.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;You can follow &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/createupdatedomains.html" target="_blank" rel="noopener"&gt;Creating and managing Amazon OpenSearch Service domains&lt;/a&gt; in the AWS documentation to provision a domain. The following sections describe some important attributes for the domain.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Network:&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;VPC access. Public access isn’t recommended for production workloads. We recommend that you use VPC access for all production workloads. Pick the VPC, subnets, and security group that you have created for the OpenSearch domain.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/28/BDB-5059-3.png" alt="Amazon OpenSearch Service network configuration screen with VPC access selected and VPC, subnet, and security group fields populated" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;Figure 4: Use VPC access.&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Fine-grained access control:&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;Turn on fine-grained access control with &lt;code&gt;OS[MasterUserRole]&lt;/code&gt; as the primary user. You can follow steps in &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/fgac-iam.html" target="_blank" rel="noopener"&gt;Tutorial: Configure a domain with an IAM master user and Amazon Cognito authentication&lt;/a&gt; to create &lt;code&gt;OS[MasterUserRole]&lt;/code&gt;.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/28/BDB-5059-4.png" alt="Amazon OpenSearch Service fine-grained access control configuration with an IAM ARN selected as the primary user" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;Figure 5: Turn on fine-grained access control with an IAM role.&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;Fine-grained access control provides numerous features to help you keep your data secure, such as document-level security, field-level security, read-only users, and OpenSearch Dashboards/Kibana tenants. Fine-grained access control requires a primary user.&lt;/p&gt; 
&lt;p&gt;The primary user is the administrator identity for your OpenSearch domain. This user can set up additional users in Amazon OpenSearch Service, assign roles to them, and assign permissions for those roles. You can choose username and password authentication for the primary user or use an IAM identity. You use these credentials to log in to OpenSearch Dashboards. Following the best practices on choosing your primary user, you should choose an IAM primary user for production workloads.&lt;/p&gt; 
&lt;p&gt;Fine-grained access control can be applied regardless of how you log in. You can follow your organization’s suggested authentication mechanism and apply fine-grained access control on top of it.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Amazon Cognito authentication:&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;To turn on Amazon Cognito authentication, select &lt;strong&gt;Enable Amazon Cognito authentication&lt;/strong&gt; and choose the Amazon Cognito user pool and Amazon Cognito identity pool for your OpenSearch Dashboards.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/28/BDB-5059-5.png" alt="Amazon OpenSearch Service authentication configuration with Amazon Cognito enabled and a user pool and identity pool selected" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;Figure 6: Turn on Amazon Cognito authentication.&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Access policy:&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;The access policy controls whether a request is accepted or rejected when it reaches the Amazon OpenSearch Service domain. You can configure a domain-level access policy to allow access to your Amazon OpenSearch Service domain.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/28/BDB-5059-6.png" alt="Amazon OpenSearch Service domain access policy editor showing a JSON policy granting access to the configured IAM principals" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;Figure 7: Configure domain-level access to the domain.&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Encryption:&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;Amazon OpenSearch Service encrypts data both in transit and at rest. When you turn on fine-grained access control, encryption is required—the corresponding settings are automatically turned on and can’t be changed. These include &lt;a href="https://en.wikipedia.org/wiki/Transport_Layer_Security" target="_blank" rel="noopener"&gt;Transport Layer Security (TLS 1.2 or later)&lt;/a&gt; for requests to the domain and for traffic between nodes in the domain, and encryption of data at rest through AWS KMS.&lt;/p&gt; 
&lt;p&gt;For encryption at rest, OpenSearch Service supports three key types: &lt;strong&gt;AWS owned keys&lt;/strong&gt;, &lt;strong&gt;AWS managed keys&lt;/strong&gt;, and &lt;strong&gt;customer managed keys&lt;/strong&gt;. While AWS owned keys provide a quick-start option with no additional configuration, &lt;strong&gt;customer managed keys are the recommended best practice&lt;/strong&gt;. Customer managed keys give you full control over the encryption key lifecycle, including key rotation policies, granular access control through key policies, and the ability to audit key usage through AWS CloudTrail. To use a customer managed key, create a symmetric encryption key in AWS KMS and select it when configuring your domain’s encryption settings.&lt;/p&gt; 
&lt;p&gt;With these configurations, you can configure your Amazon OpenSearch domain and OpenSearch Service Dashboards so that they’re accessible only within the chosen VPC. For your production scenario, you can follow your organization’s approved mechanism to access the resources in a VPC. You can access OpenSearch Service Dashboards with a primary user to create a limited-access role and map it to the IAM role with limited access to validate fine-grained access control.&lt;/p&gt; 
&lt;h2&gt;Conclusion&lt;/h2&gt; 
&lt;p&gt;In this post, we looked at the important security configurations for a public and a VPC-based Amazon OpenSearch domain. You can examine more settings for fine-grained access control in the OpenSearch Dashboards Security section.&lt;/p&gt; 
&lt;p&gt;If you have feedback about this post, submit comments in the Comments section. If you have questions about this post, start a new thread on the &lt;a href="https://forums.aws.amazon.com/forum.jspa?forumID=200" target="_blank" rel="noopener"&gt;Amazon OpenSearch Service forum&lt;/a&gt; or &lt;a href="https://console.aws.amazon.com/support/home" target="_blank" rel="noopener"&gt;contact AWS Support&lt;/a&gt;.&lt;/p&gt; 
&lt;hr&gt; 
&lt;h2&gt;About the author&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignleft size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/06/01/bdb5059a1.jpg" alt="Imtiaz (Taz) Sayed" width="100" height="100"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Imtiaz (Taz) Sayed&lt;/h3&gt; 
  &lt;p&gt;Imtiaz (Taz) Sayed is the WW Tech Leader for Analytics at AWS. He enjoys engaging with the community on all things data and analytics. He can be reached through &lt;a href="https://www.linkedin.com/in/contacttaz/" target="_blank" rel="noopener"&gt;LinkedIn&lt;/a&gt;.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignleft size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/01/07/Naren-5-e1767775578161-100x129.jpg" alt="Narendra Gupta" width="100" height="100"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Narendra Gupta&lt;/h3&gt; 
  &lt;p&gt;Narendra is a Specialist Solutions Architect at AWS, helping customers on their cloud journey with a focus on AWS analytics services. Outside of work, Narendra enjoys learning new technologies, watching movies, and visiting new places.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
		
		
			</item>
		<item>
		<title>Capture data lineage of Amazon EMR spark jobs into Amazon SageMaker Unified Studio</title>
		<link>https://aws.amazon.com/blogs/big-data/capture-data-lineage-of-amazon-emr-spark-jobs-into-amazon-sagemaker-unified-studio/</link>
					
		
		<dc:creator><![CDATA[Jose Romero]]></dc:creator>
		<pubDate>Mon, 01 Jun 2026 18:37:47 +0000</pubDate>
				<category><![CDATA[Amazon EMR]]></category>
		<category><![CDATA[Amazon SageMaker Unified Studio]]></category>
		<category><![CDATA[Analytics]]></category>
		<category><![CDATA[Intermediate (200)]]></category>
		<category><![CDATA[Technical How-to]]></category>
		<guid isPermaLink="false">acc0e74153a7fedc9e7238f262f6f4afa6b452e0</guid>

					<description>In this post, you’ll walk through a practical, step-by-step example that shows how to capture and track data lineage from Spark jobs running on Amazon EMR directly into Amazon SageMaker Catalog using OpenLineage. You’ll see how lineage metadata flows automatically and explore data relationships and dependencies across your workflows in Amazon SageMaker Unified Studio.</description>
										<content:encoded>&lt;p&gt;Data engineers running Apache Spark jobs on &lt;a href="https://aws.amazon.com/emr/" target="_blank" rel="noopener noreferrer"&gt;Amazon EMR&lt;/a&gt; face a persistent challenge: understanding how data moves through Spark pipelines as it’s transformed, joined, and written to downstream tables . Tracking these transformations manually requires examining job logs, reviewing code, and piecing together transformation logic across multiple sources. As pipelines scale, this process becomes complex. The visibility gap affects key business activities: troubleshooting data quality issues takes longer – impact analysis for schema changes requires more effort – and compliance audits need extensive documentation of data provenance.&lt;/p&gt; 
&lt;p&gt;&lt;a href="https://aws.amazon.com/sagemaker/unified-studio/" target="_blank" rel="noopener noreferrer"&gt;Amazon SageMaker &lt;/a&gt;is the center for all your data and analytics where you can find and access all the data in your organization and act on it using tools across various use case. This unified platform addresses the data visibility challenge by bringing together data governance, collaboration, and discovery into a single interface. At the heart of this platform is &lt;a href="https://aws.amazon.com/sagemaker/catalog/" target="_blank" rel="noopener noreferrer"&gt;Amazon SageMaker Catalog&lt;/a&gt;, a centralized hub that enables organizations to catalog, govern, and discover all their data assets with complete visibility into lineage. By capturing data lineage across your entire data ecosystem from raw sources through transformations to final outputs, SageMaker Catalog enables you to track data provenance across your entire platform, enable collaboration with clear visibility into data ownership and quality metrics, build trust through comprehensive data lineage that supports compliance and confident decision-making, and accelerate discovery of trustworthy, governance-ready data assets. You can access and visualize this lineage directly in Amazon SageMaker Unified Studio, which serves as the unified interface to explore data relationships and collaborate across your analytics workflows.&lt;/p&gt; 
&lt;p&gt;Amazon EMR, starting from version 7.11, now includes native &lt;a href="https://openlineage.io/" target="_blank" rel="noopener noreferrer"&gt;OpenLineage&lt;/a&gt; support that automates lineage capture. OpenLineage is an open-source framework for data lineage that automatically emits lineage metadata from your data transformation jobs directly into &lt;a href="https://aws.amazon.com/sagemaker/catalog/" target="_blank" rel="noopener noreferrer"&gt;Amazon SageMaker Catalog&lt;/a&gt;, or other data governance solutions, without requiring customizations.&lt;/p&gt; 
&lt;p&gt;This EMR native support of OpenLineage is part of a growing set of integrations across AWS analytics services including &lt;a href="https://aws.amazon.com/glue/" target="_blank" rel="noopener noreferrer"&gt;AWS Glue&lt;/a&gt;, &lt;a href="https://aws.amazon.com/emr/serverless/" target="_blank" rel="noopener noreferrer"&gt;Amazon EMR Serverless&lt;/a&gt;, and &lt;a href="https://aws.amazon.com/redshift/" target="_blank" rel="noopener noreferrer"&gt;Amazon Redshift&lt;/a&gt;. The complete list of services with native OpenLineage integration can be found in the &lt;a href="https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/datazone-support-matrix.html" target="_blank" rel="noopener noreferrer"&gt;data lineage support matrix&lt;/a&gt;.&lt;/p&gt; 
&lt;p&gt;In this post, you’ll walk through a practical, step-by-step example that shows how to capture and track data lineage from Spark jobs running on Amazon EMR directly into Amazon SageMaker Catalog using OpenLineage. You’ll see how lineage metadata flows automatically and explore data relationships and dependencies across your workflows in Amazon SageMaker Unified Studio.&lt;/p&gt; 
&lt;h2&gt;&lt;strong&gt;Solution overview&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;Imagine you’re part of a large enterprise that relies on HR analytics to optimize workforce planning, compensation strategies, and talent retention practices. Your data engineering team owns the delivery of these analytical products by processing raw HR datasets (including employee records, attendance logs, and compensation details), with Spark jobs running on your Amazon EMR infrastructure.&lt;/p&gt; 
&lt;p&gt;With time, Spark jobs have grown in complexity. Your team now struggles to maintain visibility into how data moves through pipelines, who modified it, and how to map dependencies between datasets and final analytical products.&lt;/p&gt; 
&lt;p&gt;The following solution demonstrates how you can address these challenges by automatically capturing data lineage end-to-end from Spark jobs running on your EMR infrastructure and visualizing it in Amazon SageMaker Unified Studio so that you and the business understand data provenance of the final analytical products.&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90380" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/17/bdb-5697-01.png" alt="AWS cloud data pipeline architecture diagram showing data flowing from Amazon S3 CSV files (employees.csv, attendance.csv) through Amazon EMR with Apache Spark processing, AWS Glue Data Catalog metadata management, and Amazon SageMaker Catalog integration, producing salary_adjustments.csv and bonus_payments.csv output files stored in Amazon S3." width="800" height="555"&gt;&lt;/p&gt; 
&lt;p&gt;The architecture includes a &lt;strong&gt;Data Layer&lt;/strong&gt; with CSV files containing employee, attendance, salary, and bonus data stored in &lt;a href="https://aws.amazon.com/s3/" target="_blank" rel="noopener noreferrer"&gt;Amazon S3 (Simple Storage Service)&lt;/a&gt;, representing typical HR and payroll source systems.&lt;/p&gt; 
&lt;p&gt;The &lt;strong&gt;Processing Layer&lt;/strong&gt; uses Amazon EMR cluster running Apache Spark jobs that transform raw data into analytical tables. The first Spark job joins employee and attendance data while the second Spark job combines attendance with compensation data. Both jobs use &lt;a href="https://iceberg.apache.org/" target="_blank" rel="noopener noreferrer"&gt;Apache Iceberg&lt;/a&gt; table format to provide ACID (Atomic, Consistent, Isolated, and Durable) transactions and time travel capabilities.&lt;/p&gt; 
&lt;p&gt;The &lt;strong&gt;Metadata Layer&lt;/strong&gt; uses &lt;a href="https://docs.aws.amazon.com/glue/latest/dg/catalog-and-crawler.html" target="_blank" rel="noopener noreferrer"&gt;AWS Glue Data Catalog&lt;/a&gt; to store Iceberg table metadata, making tables discoverable and accessible across AWS analytics services. A &lt;strong&gt;Lineage Layer&lt;/strong&gt; uses the OpenLineage integration in EMR to automatically track input/output datasets (CSV files and Iceberg tables), transformation logic at column level (joins, filters, aggregations), and job execution metadata.&lt;/p&gt; 
&lt;p&gt;Finally, the &lt;strong&gt;Data Governance Layer&lt;/strong&gt; uses Amazon SageMaker Catalog to capture and process OpenLineage events posted by the EMR Spark jobs and automatically build a comprehensive lineage graph that shows complete data provenance from CSV source files through Spark transformations to Iceberg analytical tables.&lt;/p&gt; 
&lt;p&gt;Before you deploy this solution, make sure you have the following resources in place.&lt;/p&gt; 
&lt;h2&gt;&lt;strong&gt;Prerequisites&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;For this walkthrough, you should have the following prerequisites:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;An AWS account.&lt;/li&gt; 
 &lt;li&gt;Your assumed role should have full access to Amazon EMR serverless, Amazon S3, &lt;a href="https://aws.amazon.com/iam" target="_blank" rel="noopener noreferrer"&gt;Amazon Identity and Access Management (IAM)&lt;/a&gt; and &lt;a href="https://aws.amazon.com/pm/lambda" target="_blank" rel="noopener noreferrer"&gt;AWS Lambda&lt;/a&gt;. Note that for production workloads, minimum permissions are recommended.&lt;/li&gt; 
 &lt;li&gt;A &lt;a href="https://aws.amazon.com/vpc/" target="_blank" rel="noopener noreferrer"&gt;Amazon VPC (Virtual Private Cloud)&lt;/a&gt; with at least one subnet with internet access. You can provision this VPC as you create the Amazon SageMaker domain next.&lt;/li&gt; 
 &lt;li&gt;An existing Amazon SageMaker Unified Studio domain and project. To get started, use the quick setup option as explained &lt;a href="https://docs.aws.amazon.com/sagemaker-unified-studio/latest/adminguide/create-domain-sagemaker-unified-studio-quick.html" target="_blank" rel="noopener noreferrer"&gt;here&lt;/a&gt;. To create a project, follow the instructions &lt;a href="https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/getting-started-create-a-project.html" target="_blank" rel="noopener noreferrer"&gt;here&lt;/a&gt;.&lt;/li&gt; 
 &lt;li&gt;An S3 bucket with the sample data files and Spark scripts uploaded (see Prepare Your Source Data below)&lt;/li&gt; 
 &lt;li&gt;Default EMR service roles — if this is your first time using EMR in this account, run `&lt;em&gt;aws emr create-default-roles&lt;/em&gt;` from the AWS CLI or CloudShell to create them.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;With these prerequisites in place, let’s examine what the AWS CloudFormation template will deploy to your AWS environment.&lt;/p&gt; 
&lt;h2&gt;&lt;strong&gt;Architecture components&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;The deployment creates several interconnected components that work together to capture and visualize lineage:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;An S3 bucket to store all data and artifacts for the solution.&lt;/li&gt; 
 &lt;li&gt;An EMR cluster (v 7.12.0) with &lt;a href="https://iceberg.apache.org/" target="_blank" rel="noopener noreferrer"&gt;Apache Iceberg&lt;/a&gt; support enabled and OpenLineage integration pre-installed, ready to run Spark jobs with lineage tracking.&lt;/li&gt; 
 &lt;li&gt;A set of IAM policies that grant the necessary permissions to the EMR cluster to post lineage events to your SageMaker Unified Studio domain.&lt;/li&gt; 
 &lt;li&gt;A set of &lt;a href="https://aws.amazon.com/lake-formation/" target="_blank" rel="noopener noreferrer"&gt;AWS Lake Formation&lt;/a&gt; permissions that grant the EMR cluster to create, alter, and drop Iceberg tables in your specified Glue database.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;With an understanding of what will be deployed, you’re ready to launch the CloudFormation stack.&lt;/p&gt; 
&lt;h2&gt;&lt;strong&gt;Deploy the solution&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; While this walkthrough uses the AWS EMR console and AWS CLI to verify the cluster and run Spark jobs, you can also perform these steps directly from Amazon SageMaker Unified Studio. SMUS provides a unified interface to create and manage EMR clusters, submit Spark jobs, and monitor execution — all within the same environment where you’ll later explore the lineage captured in Amazon SageMaker Catalog.&lt;/p&gt; 
&lt;h3&gt;&lt;strong&gt;Prepare your source data&lt;/strong&gt;&lt;/h3&gt; 
&lt;p&gt;Before deploying the CloudFormation stack, clone or download the &lt;a href="https://github.com/aws-samples/sample-capture-data-lineage-of-amazon-emr-ec2" target="_blank" rel="noopener"&gt;Git repository&lt;/a&gt;.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-bash"&gt;git clone https://github.com/aws-samples/sample-capture-data-lineage-of-amazon-emr-ec2&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Upload the CSV files downloaded from git to the &lt;code&gt;input/&lt;/code&gt; prefix and the spark scripts in &lt;code&gt;scripts/&lt;/code&gt; prefix. You can run the following command to upload the files:&lt;/p&gt; 
&lt;pre&gt;&lt;code class="lang-bash"&gt;aws s3 cp employees.csv s3://YOUR-BUCKET/input/
aws s3 cp attendance.csv s3://YOUR-BUCKET/input/
aws s3 cp salary_adjustments.csv s3://YOUR-BUCKET/input/
aws s3 cp bonus_payments.csv s3://YOUR-BUCKET/input/
aws s3 cp emr-lineage-spark-job.py s3://YOUR-BUCKET/scripts/
aws s3 cp emr-lineage-compensation-job.py s3://YOUR-BUCKET/scripts/&lt;/code&gt;&lt;/pre&gt; 
&lt;p&gt;To deploy the solution, complete the following steps in CloudFormation console:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;Create new stack by specifying the CloudFormation yaml file previously download from git repository &lt;code&gt;PutHereThe YMLFileName&lt;/code&gt;&lt;/li&gt; 
 &lt;li&gt;Enter a stack name (such as, &lt;code&gt;emr-lineage-demo&lt;/code&gt;) and provide the following parameters: 
  &lt;ul&gt; 
   &lt;li&gt;&lt;strong&gt;SourceS3BucketName: &lt;/strong&gt;S3 bucket containing your CSV files and Spark scripts&lt;/li&gt; 
   &lt;li&gt;&lt;strong&gt;SourceCSVPrefix: S&lt;/strong&gt;3 prefix where CSV files are located&lt;/li&gt; 
   &lt;li&gt;&lt;strong&gt;SourceScriptsPrefix: &lt;/strong&gt;S3 prefix where Spark scripts are located&lt;/li&gt; 
   &lt;li&gt;&lt;strong&gt;GlueDatabaseName: &lt;/strong&gt;The name of the Glue database associated to your Amazon SageMaker Unified Studio project.&lt;/li&gt; 
   &lt;li&gt;&lt;strong&gt;DataZoneDomainId&lt;/strong&gt;: Your SageMaker Unified Studio domain ID.&lt;/li&gt; 
   &lt;li&gt;&lt;strong&gt;VpcId&lt;/strong&gt;: The id of the VPC that was deployed as part of the prerequisites.&lt;/li&gt; 
   &lt;li&gt;For &lt;strong&gt;EMRReleaseLabel&lt;/strong&gt;, &lt;strong&gt;MasterInstanceType&lt;/strong&gt;, &lt;strong&gt;CoreInstanceType&lt;/strong&gt; and &lt;strong&gt;CoreInstanceCount&lt;/strong&gt;, keep the default values.&lt;/li&gt; 
  &lt;/ul&gt; &lt;/li&gt; 
 &lt;li&gt;Acknowledge IAM resource creation, choose &lt;strong&gt;Next&lt;/strong&gt; and then &lt;strong&gt;Submit. &lt;/strong&gt;The CloudFormation stack takes approximately 10 to 15 minutes to complete.&lt;/li&gt; 
 &lt;li&gt;In the &lt;a href="https://console.aws.amazon.com/emr" target="_blank" rel="noopener noreferrer"&gt;EMR console&lt;/a&gt;, wait for the cluster status to show as &lt;strong&gt;WAITING&lt;/strong&gt; before moving to the next step.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90379" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/17/bdb-5697-02.png" alt="Screenshot of the Amazon EMR on EC2 Clusters management console showing a list of 14 clusters, with the cluster &amp;quot;EMR-Lineage-Demo-emr-ec2-lineage-demo-stack&amp;quot; (ID: j-3APWOTUDNYO2T) highlighted in a &amp;quot;Waiting – Ready to run steps&amp;quot; status with a green badge." width="1058" height="202"&gt;&lt;/p&gt; 
&lt;p&gt;Now that the EMR cluster is running with OpenLineage enabled, let’s examine how the Spark jobs are configured to capture lineage metadata.&lt;/p&gt; 
&lt;h3&gt;&lt;strong&gt;Explore data lineage configuration in EMR&lt;/strong&gt;&lt;/h3&gt; 
&lt;p&gt;When submitting Spark jobs to EMR, specific configurations enable OpenLineage to create and post lineage events to SageMaker Unified Studio as the job runs:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;code&gt;spark.hadoop.hive.metastore.client.factory.class&lt;/code&gt; – Configures Spark to use AWS Glue as the Hive metastore.&lt;/li&gt; 
 &lt;li&gt;&lt;code&gt;spark.jars&lt;/code&gt; – Path to the pre-installed OpenLineage library (available on EMR 7.11+).&lt;/li&gt; 
 &lt;li&gt;&lt;code&gt;spark.extraListeners&lt;/code&gt; – Registers an OpenLineage listener to capture metadata of input / output datasets and transformations.&lt;/li&gt; 
 &lt;li&gt;&lt;code&gt;spark.openlineage.transport.type &lt;/code&gt;– Uses the OpenLineage DataZone transport option to send lineage events directly into SageMaker Catalog.&lt;/li&gt; 
 &lt;li&gt;&lt;code&gt;spark.openlineage.transport.domainId&lt;/code&gt; – The ID of your SageMaker Unified Studio domain, that serves as the target for lineage events.&lt;/li&gt; 
 &lt;li&gt;&lt;code&gt;spark.glue.accountId&lt;/code&gt; – Your AWS account ID for Glue data catalog operations.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;Now that you understand the configuration that enables automatic lineage capture, you’re ready to run the data pipeline.&lt;/p&gt; 
&lt;p&gt;When running this two-step pipeline, you will calculate the total employee compensation by combining salary adjustments, bonuses, and attendance data. The final analytical asset will serve payroll processing and budgeting.&lt;/p&gt; 
&lt;h3&gt;&lt;strong&gt;Run employee attendance analysis job&lt;/strong&gt;&lt;/h3&gt; 
&lt;p&gt;The first job reads employee details (in &lt;code&gt;employees.csv&lt;/code&gt; dataset) and attendance records (in &lt;code&gt;attendance.csv&lt;/code&gt; dataset), joins the datasets on &lt;code&gt;EmployeeID&lt;/code&gt; and creates a unified dataset (&lt;code&gt;employee_attendance&lt;/code&gt; Iceberg table) in your Glue database.&lt;/p&gt; 
&lt;p&gt;Follow the steps below to run this first job:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;In the &lt;a href="https://console.aws.amazon.com/cloudformation" target="_blank" rel="noopener noreferrer"&gt;CloudFormation console&lt;/a&gt;, navigate to the stack’s &lt;strong&gt;Outputs&lt;/strong&gt; tab&lt;/li&gt; 
 &lt;li&gt;Copy the value of the &lt;code&gt;Job1SubmitCommand&lt;/code&gt; output key. Note that this is the command you’ll use to submit the first job in EMR with the right configuration.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90378" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/17/bdb-5697-03.png" alt="AWS CloudFormation console screenshot showing the Outputs tab for the &amp;quot;emr-ec2-lineage-demo-stack&amp;quot; stack, displaying 9 outputs including the Job1SubmitCommand — an AWS EMR add-steps command with Apache Spark configuration for the EMR Lineage Demo Job targeting cluster j-3APWOTUDNYO2T." width="1496" height="684"&gt;&lt;/p&gt; 
&lt;ol start="3"&gt; 
 &lt;li&gt;Run the command in your terminal or AWS CloudShell.&lt;/li&gt; 
 &lt;li&gt;Monitor the job in the Amazon EMR console under &lt;strong&gt;Steps.&lt;/strong&gt;&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90377" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/17/bdb-5697-04.png" alt="Screenshot of the Amazon EMR console Steps tab for the cluster &amp;quot;EMR-Lineage-Demo-emr-ec2-lineage-demo-stack,&amp;quot; showing one completed step named &amp;quot;EMR-Lineage-Demo-Job&amp;quot; with Step ID s-0270631D8DHBCJZKBAZ and a green &amp;quot;Completed&amp;quot; status checkmark." width="1143" height="441"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Run employee compensation analysis job&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;Now, you will calculate the total employee compensation (Iceberg table) by combining salary adjustments (&lt;code&gt;salary_adjustments.csv&lt;/code&gt; dataset), bonuses (&lt;code&gt;bonus_payments.csv&lt;/code&gt; dataset), and attendance (calculated in the last step):&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;Repeat the steps 1 to 4 to run Job 2.&lt;/li&gt; 
 &lt;li&gt;After completion, open the &lt;a href="https://console.aws.amazon.com/glue" target="_blank" rel="noopener noreferrer"&gt;AWS Glue console&lt;/a&gt;.&lt;/li&gt; 
 &lt;li&gt;Navigate to &lt;strong&gt;Data Catalog&lt;/strong&gt;, then &lt;strong&gt;Tables &lt;/strong&gt;and &lt;strong&gt;s&lt;/strong&gt;elect your SageMaker project’s database.&lt;/li&gt; 
 &lt;li&gt;Confirm that &lt;code&gt;employee_attendance&lt;/code&gt; and &lt;code&gt;employee_compensation&lt;/code&gt; tables are listed.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;With both Spark jobs complete, you can now visualize the complete data lineage graph in Amazon SageMaker Unified Studio.&lt;/p&gt; 
&lt;h3&gt;&lt;strong&gt;Visualizing lineage in SageMaker Unified Studio&lt;/strong&gt;&lt;/h3&gt; 
&lt;p&gt;SageMaker Unified Studio provides a graph-based data lineage visualization that helps data engineers, analysts, and data scientists clearly understand which source datasets (files or tables) feed into each dataset, what transformations and logic are applied at every step, which downstream analytics assets consume the data, and how changes to upstream data or transformations may impact the rest of the data pipeline.&lt;/p&gt; 
&lt;p&gt;Now that the data pipeline run successfully, let’s review the captured lineage for the HR data in SageMaker Unified Studio:&lt;/p&gt; 
&lt;ol start="1"&gt; 
 &lt;li&gt;Navigate to the &lt;a href="https://console.aws.amazon.com/datazone" target="_blank" rel="noopener noreferrer"&gt;SageMaker Unified Studio console&lt;/a&gt;, sign in to your domain.&lt;/li&gt; 
 &lt;li&gt;Open your project and go to &lt;strong&gt;Data Sources&lt;/strong&gt;&lt;/li&gt; 
 &lt;li&gt;Find your AWS Glue Data Catalog source&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90376" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/17/bdb-5697-05.png" alt="Screenshot of the Amazon SageMaker project catalog Data Sources page listing three configured data sources: a Redshift Serverless source, an AWS Glue Lakehouse source named &amp;quot;AwsDataCatalog-emr_ec2_lineage_blogpost_glue_db-default-datasource&amp;quot; (highlighted), and a Tooling SageMaker model package group source — all scheduled MTWTFSS and in Ready or Running status." width="1491" height="447"&gt;&lt;/p&gt; 
&lt;ol start="4"&gt; 
 &lt;li&gt;Click &lt;strong&gt;RUN&lt;/strong&gt;. Two new assets will be created.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90375" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/17/bdb-5697-06.png" alt="Screenshot of the AWS Glue Data Catalog interface showing run activities for the data source &amp;quot;AwsDataCatalog-emr_ec2_lineage_blogpost_glue_db-default-datasource,&amp;quot; with two completed on-demand runs and a highlighted asset table showing employee_attendance and employee_compensation successfully created in the emr_ec2_lineage_blogpost_glue_db database." width="1879" height="560"&gt;&lt;/p&gt; 
&lt;ol start="5"&gt; 
 &lt;li&gt;Navigate to &lt;strong&gt;Assets&lt;/strong&gt; and Click on &lt;em&gt;employee_compensation&lt;/em&gt;. Under the &lt;strong&gt;LINEAGE&lt;/strong&gt; tab you’ll find the lineage graph view that SageMaker builds based on the OpenLineage metadata captured from the EMR Spark jobs as they run.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90374" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/17/bdb-5697-07.png" alt="AWS Glue data lineage visualization showing the flow of the employee_compensation dataset from an Apache Spark job (default.emr_lineage_compensa, COMPLETE, Dec 22 2025 11:42:47 AM) through an AWS Glue Iceberg table (20 columns) to an AWS Glue Inventory destination table, with a right sidebar displaying lineage metadata including the dataset ARN, OpenLineage producer URL, Iceberg snapshot ID, and projected field names EmployeeID, Name, and Department." width="1856" height="730"&gt;&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li style="list-style-type: none"&gt; 
  &lt;ul&gt; 
   &lt;li&gt;You’ll first see three lineage nodes from left to right: one representing the EMR Spark job that created the final Iceberg table, a second one representing the actual Iceberg table in the Glue catalog, and a third one representing the data asset in the SageMaker Catalog inventory that maps to the Glue table.&lt;/li&gt; 
   &lt;li&gt;Click on any lineage node to view its underlying metadata in the details pane, including dataset names, S3 locations, schema, data types, job execution details and more.&lt;/li&gt; 
  &lt;/ul&gt; &lt;/li&gt; 
&lt;/ul&gt; 
&lt;ol start="6"&gt; 
 &lt;li&gt;Expand the lineage to the left by clicking on the double arrow next to the first lineage node. Keep expanding until you hit the originating datasets.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90373" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/17/bdb-5697-08.png" alt="Data pipeline lineage diagram showing the complete ETL flow from Amazon S3 source files (input/attendance.csv with 6 columns, input/employees.csv with 5 columns) through two Apache Spark jobs to intermediate tables (input/salary_adjustments.csv, iceberg/employee.csv, AWS Glue employee_attendance with 14 columns) and final destination tables (AWS Glue iceberg/employee_compensation with 29 columns, AWS Glue Inventory employee_compensation_hive with 30 columns), all timestamped Dec 22, 2025." width="1851" height="449"&gt;&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li style="list-style-type: none"&gt; 
  &lt;ul&gt; 
   &lt;li&gt;Expanding the graph to the left reveals the complete data pipeline back to original CSV source files. You can see how compensation data depends on upstream attendance analytics.&lt;/li&gt; 
   &lt;li&gt;Note how each lineage node represents an element in the data pipeline you run, including both Spark jobs and even the intermediate &lt;em&gt;employee_attendance&lt;/em&gt; Iceberg table that connects them.&lt;/li&gt; 
  &lt;/ul&gt; &lt;/li&gt; 
&lt;/ul&gt; 
&lt;ol start="7"&gt; 
 &lt;li&gt;You can expand column-level lineage by clicking on the column section of a lineage node of a dataset or data asset. This allows you to understand how data changes at a column level as it goes downstream your data pipeline.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90372" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/17/bdb-5697-09.png" alt="Data lineage diagram showing the employee compensation ETL pipeline with four Amazon S3 source tables (employee.csv with 5 columns, input/attendance.csv with 6 columns, input/salary_adjustments.csv with 4 columns, output/employee_attendance.csv with 14 columns) processed by two Apache Spark jobs to produce a final s3://employee_compensation table with 20 columns, all dated Dec 22, 2025." width="1794" height="813"&gt;&lt;/p&gt; 
&lt;h2&gt;&lt;strong&gt;Cleanup&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;To avoid ongoing charges, clean up the resources:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;First, empty the destination bucket by running the following command in your terminal or with AWS CloudShell.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p style="padding-left: 40px"&gt;&lt;code&gt;aws s3 rm s3://${DEST_BUCKET}/ --recursive&lt;/code&gt;&lt;/p&gt; 
&lt;ol start="2"&gt; 
 &lt;li&gt;Delete the CloudFormation stack. 
  &lt;ul&gt; 
   &lt;li&gt;On the &lt;a href="https://console.aws.amazon.com/cloudformation" target="_blank" rel="noopener noreferrer"&gt;AWS CloudFormation console&lt;/a&gt;, choose &lt;strong&gt;Stacks&lt;/strong&gt; in the navigation pane.&lt;/li&gt; 
   &lt;li&gt;Choose the stack you created, then choose &lt;strong&gt;Delete&lt;/strong&gt; and then &lt;strong&gt;Delete&lt;/strong&gt; stack when prompted.&lt;/li&gt; 
  &lt;/ul&gt; &lt;/li&gt; 
&lt;/ol&gt; 
&lt;h2&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;In this post, you explore how to capture data lineage from Spark jobs in Amazon EMR (v7.11+) directly into Amazon SageMaker Unified Studio. You learned how to set up an Amazon EMR cluster with native OpenLineage support to automatically track lineage metadata from Spark jobs processing your data. You also configured the integration between EMR and Amazon SageMaker Catalog to ensure lineage information flows seamlessly into your governance platform. Finally, you explored the resulting lineage graph in SageMaker Unified Studio and saw how it provides comprehensive visibility into data transformations, from source CSV files through Spark processing jobs to final analytical tables using Apache Iceberg format.&lt;/p&gt; 
&lt;p&gt;We encourage you to now test these capabilities with your own data pipelines running on EMR. By implementing automated lineage tracking, many customers have strengthened their governance frameworks while gaining valuable insights into data dependencies, impact analysis, and compliance requirements. This approach enables data teams to build trust in their analytics outputs while maintaining the agility needed to derive business value from their data assets.&lt;/p&gt; 
&lt;hr style="width: 80%"&gt; 
&lt;h2&gt;About the authors&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="wp-image-90370 size-thumbnail alignleft" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/17/bdb-5697-11-100x116.jpeg" alt="" width="100" height="116"&gt;
  &lt;/div&gt; 
  &lt;p&gt;&lt;strong&gt;Yanick Houngbedji&lt;/strong&gt; is a Solutions Architect for Independent Software Vendors (ISV) at Amazon Web Services (AWS), based in Montréal, Canada. He specializes in helping customers architect and implement highly scalable, performant, and secure cloud solutions on AWS. Before joining AWS, he spent over 8 years providing technical leadership in data engineering, big data analytics, business intelligence, and data science solutions.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="size-full wp-image-90371 alignleft" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/17/bdb-5697-10.jpeg" alt="" width="100" height="100"&gt;
  &lt;/div&gt; 
  &lt;p&gt;&lt;strong&gt;Jose Romero&lt;/strong&gt; is a Senior Solutions Architect for Startups at Amazon Web Services (AWS) based in Austin, TX, US. He is passionate about helping customers architect modern platforms at scale for data, AI, and ML. As a former senior architect in AWS Professional Services, he enjoys building and sharing solutions for common complex problems so that customers can accelerate their cloud journey and adopt best practices. Connect with him on &lt;a href="https://www.linkedin.com/in/joserom/"&gt;LinkedIn&lt;/a&gt;.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
					
			
		
		
			</item>
		<item>
		<title>The next generation of Amazon OpenSearch Serverless: Built from the ground up for agents</title>
		<link>https://aws.amazon.com/blogs/big-data/the-next-generation-of-amazon-opensearch-serverless-built-from-the-ground-up-for-agents/</link>
		
		<dc:creator><![CDATA[Sohaib Katariwala]]></dc:creator>
		<pubDate>Thu, 28 May 2026 18:24:54 +0000</pubDate>
				<category><![CDATA[Amazon OpenSearch Service]]></category>
		<category><![CDATA[Announcements]]></category>
		<category><![CDATA[Intermediate (200)]]></category>
		<category><![CDATA[Serverless]]></category>
		<guid isPermaLink="false">d884ba3fc13604eba98a5b3a53a9c668462c60d9</guid>

					<description>Today, we are announcing a ground-up re-architecture of Amazon OpenSearch Serverless that delivers up to 20 times faster autoscaling, scale to zero, and up to 60% lower cost than provisioning clusters for peak load. Amazon OpenSearch Service is a fully managed, open source retrieval engine that unifies vector, lexical, hybrid, and agentic search, delivering low-latency, accurate and relevant results. Amazon OpenSearch Serverless is an automatically scaled deployment option. The new architecture decouples compute from storage. The service provisions infrastructure in seconds instead of minutes, and scales compute all the way to zero when your application is idle. In this post, we walk through the new architecture, what it means for your applications, and how to get started with a hands-on tutorial.</description>
										<content:encoded>&lt;p&gt;&lt;em&gt;Audience note: This is the deep-dive technical launch post. For a shorter overview of what changed and why, see the related post on the &lt;a href="https://aws.amazon.com/blogs/aws/introducing-the-next-generation-of-amazon-opensearch-serverless-for-building-your-agentic-ai-applications/" target="_blank" rel="noopener"&gt;AWS News Blog&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;Today, we are announcing a ground-up re-architecture of Amazon OpenSearch Serverless that delivers up to 20 times faster autoscaling, scale to zero, and up to 60% lower cost than provisioning clusters for peak load. Amazon OpenSearch Service is a fully managed, open source retrieval engine that unifies vector, lexical, hybrid, and agentic search, delivering low-latency, accurate and relevant results. Amazon OpenSearch Serverless is an automatically scaled deployment option.&lt;/p&gt; 
&lt;p&gt;Modern workloads are increasingly dynamic and unpredictable. An ecommerce platform sees a 10x traffic spike during a flash sale. An artificial intelligence (AI) agent triggers hundreds of concurrent vector queries while reasoning through a multi-step task, then goes idle. A multi-tenant SaaS application serves dozens of tenants with wildly different activity patterns. These workloads need infrastructure that scales up to meet demand and releases resources when demand drops.&lt;/p&gt; 
&lt;p&gt;That is why we rebuilt the &lt;a href="https://aws.amazon.com/opensearch-service/features/serverless/" target="_blank" rel="noopener"&gt;Amazon OpenSearch Serverless&lt;/a&gt; architecture from the ground up. The new architecture decouples compute from storage. The service provisions infrastructure in seconds instead of minutes, and scales compute all the way to zero when your application is idle. In this post, we walk through the new architecture, what it means for your applications, and how to get started with a hands-on tutorial.&lt;/p&gt; 
&lt;div&gt;
 With this launch, Amazon OpenSearch Serverless introduces two named architectures. Existing collections are now referred to as 
 &lt;strong&gt;Classic&lt;/strong&gt; collections. The new architecture is called 
 &lt;strong&gt;NextGen&lt;/strong&gt; and is now the default when you create a new collection via the AWS Console. You can use NextGen architecture in the API by specifying 
 &lt;code&gt;--generation NEXTGEN&lt;/code&gt; in the CLI. To continue using the Classic architecture, specify 
 &lt;code&gt;--generation CLASSIC&lt;/code&gt; in the CLI or omit the optional 
 &lt;code&gt;--generation&lt;/code&gt; parameter.
&lt;/div&gt; 
&lt;h2&gt;What this means for your applications&lt;/h2&gt; 
&lt;p&gt;The new architecture delivers improvements across three pillars: performance, cost, and a simplified user experience.&lt;/p&gt; 
&lt;h3&gt;Performance: Autoscaling in seconds&lt;/h3&gt; 
&lt;p&gt;An OpenSearch Compute Unit (OCU) is the unit of compute capacity that powers your indexing and search workloads. Amazon OpenSearch Serverless now provisions additional OCUs in seconds. When traffic arrives, the service adds resources in line with demand instead of reacting after a worker is already under pressure. The same mechanism scales the infrastructure back down quickly when traffic drops. The new architecture scales capacity up to &lt;strong&gt;20 times faster&lt;/strong&gt; than the previous architecture, so your users experience consistent performance during traffic surges, and you stop paying for capacity when you no longer need it.&lt;/p&gt; 
&lt;h3&gt;Cost efficiency: Pay only for what you use&lt;/h3&gt; 
&lt;p&gt;Indexing, search, storage, and Vector Index GPU-Acceleration are metered and billed independently, so you can see and optimize each dimension of your workload separately.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Decoupled compute and storage:&lt;/strong&gt; OpenSearch Serverless now has full decoupling between compute and storage, allowing OCUs to scale up and down irrespective of the amount of data stored in a collection. This is powered by a new storage layer that is accessible to both indexing and search OCUs. You can now have multiple indices with data indexed in them but not pay any compute costs if you are not actively indexing or searching data. For workloads with significant idle time, the new architecture can reduce infrastructure costs by up to 60% compared to the cost of provisioning OpenSearch Service domains for peak capacity.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Scale to zero:&lt;/strong&gt; When no requests arrive within the idle timeout window (10 minutes), the service releases compute resources and your OCU usage scales to 0. When traffic resumes, capacity is back in approximately 10 seconds. During this window, the service queues incoming requests and serves them once capacity is available; it does not drop them. If you anticipate a burst of traffic, for example before a scheduled batch job or a marketing campaign, you can send a lightweight query (such as a &lt;code&gt;match_all&lt;/code&gt; with &lt;code&gt;size=1&lt;/code&gt;) to warm the collection before your application starts sending production traffic. This reduces the latency your users experience on the first real request. Indexing and search scale independently. If you have no search requests, search OCUs scale to zero, even while OpenSearch Serverless maintains indexing OCUs for indexing requests, and vice versa.&lt;/p&gt; 
&lt;div&gt;
 &lt;strong&gt;GPU acceleration for vector workloads:&lt;/strong&gt; For vector collections created in the new architecture, OpenSearch Serverless automatically uses GPU-backed compute to accelerate Hierarchical Navigable Small World (HNSW) vector index construction, significantly reducing indexing time compared to CPU-only builds. GPU acceleration kicks in automatically whenever there is an opportunity to leverage GPUs to reduce overall indexing time and cost. In the Classic architecture, you had to opt in or out of GPU acceleration at the collection level through the API. If you want to disable GPU acceleration for NextGen collections for a specific index, you can 
 &lt;a href="https://docs.opensearch.org/latest/vector-search/remote-index-build/" target="_blank" rel="noopener"&gt;turn off the remote index build setting at the index level&lt;/a&gt;. GPU usage appears as a separate line item on your bill, so you have full visibility into when acceleration was active and what it cost. For more details on how GPU acceleration works and performance benchmarks, refer to 
 &lt;a href="https://aws.amazon.com/blogs/big-data/build-billion-scale-vector-databases-in-under-an-hour-with-gpu-acceleration-on-amazon-opensearch-service/" target="_blank" rel="noopener"&gt;Build billion-scale vector databases in under an hour with GPU acceleration on Amazon OpenSearch Service&lt;/a&gt;.
&lt;/div&gt; 
&lt;h3&gt;Simplified experience: Fewer steps to production&lt;/h3&gt; 
&lt;p&gt;We also simplified the day-to-day experience of running OpenSearch Serverless:&lt;/p&gt; 
&lt;p&gt;With the new architecture, you can provision a collection and start sending requests in seconds. There is no need for capacity planning, no sizing decisions, and no waiting for infrastructure to warm up. This makes Amazon OpenSearch Serverless a natural fit for agentic workloads, where an AI agent can spin up a vector search or retrieval step on demand and expect a response without delay.&lt;/p&gt; 
&lt;p&gt;To make getting started even faster, we have introduced Express Create on the console. You supply a collection name and a collection type, choose Express Create, and your collection is active in seconds with no upfront network, encryption, or access policies to configure. You can add those later if your workload requires them.&lt;/p&gt; 
&lt;p&gt;Collection groups and collections can also be created programmatically using the AWS Command Line Interface (AWS CLI) and AWS SDKs. AWS CloudFormation support is coming soon.&lt;/p&gt; 
&lt;p&gt;The new architecture introduces two endpoint formats on the &lt;code&gt;on.aws&lt;/code&gt; domain. The per-collection endpoint (&lt;code&gt;&amp;lt;collectionId&amp;gt;.aoss.&amp;lt;region&amp;gt;.on.aws&lt;/code&gt;) works the same way as before with one endpoint per collection. The per-account Regional endpoint (&lt;code&gt;&amp;lt;accountId&amp;gt;.aoss.&amp;lt;region&amp;gt;.on.aws&lt;/code&gt;) is new: it serves all of your collections through a single hostname, with the target collection identified in each request using the &lt;code&gt;x-amz-aoss-collection-name&lt;/code&gt; or &lt;code&gt;x-amz-aoss-collection-id&lt;/code&gt; header. This means one connection pool, one Transport Layer Security (TLS) session, and one endpoint to manage regardless of how many collections you have — a significant improvement for multi-tenant workloads where each tenant maps to its own collection. Both endpoints use standard AWS PrivateLink, so you create virtual private cloud (VPC) endpoints from the VPC console or the EC2 API just like any other AWS service. Private Domain Name System (DNS) is configured automatically, eliminating the Amazon Route 53 Private Hosted Zones, forwarding rules, and custom DNS infrastructure that were required with the original architecture. Cross-VPC, cross-account, and on-premises access all work using standard &lt;code&gt;vpce-*&lt;/code&gt; DNS names with no additional setup.&lt;/p&gt; 
&lt;p&gt;&lt;a href="https://aws.amazon.com/blogs/big-data/amazon-opensearch-serverless-introduces-collection-groups-to-optimize-cost-for-multi-tenant-workloads/" target="_blank" rel="noopener"&gt;Collection groups are the new unit of organization&lt;/a&gt; for your collections. You can share compute capacity across multiple collections with &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-collection-groups.html" target="_blank" rel="noopener"&gt;Collection Groups&lt;/a&gt;, which reduces cost for smaller collections that have complementary traffic patterns. You can also assign different AWS Key Management Service (AWS KMS) keys to collections within the same group, so you get both cost efficiency and per-collection encryption isolation. Collection groups are required when creating collections with the new architecture.&lt;/p&gt; 
&lt;p&gt;You also get the benefits of OpenSearch open-source releases without needing to manage versions and upgrades. The service tracks upstream releases automatically.&lt;/p&gt; 
&lt;p&gt;Amazon OpenSearch Serverless is also available on the &lt;a href="https://vercel.com/marketplace/aws" target="_blank" rel="noopener"&gt;Vercel Marketplace&lt;/a&gt;, making it straightforward for developers to add search infrastructure directly from their Vercel projects. You can link an existing AWS account through delegated access, or get started through a Limited Scope Account with USD $100 in AWS credit if you are new to AWS.&lt;/p&gt; 
&lt;p&gt;The integration creates a collection with sensible defaults, scale-to-zero billing, public endpoints, and AWS-managed encryption, and automatically sets connection details as environment variables in your Vercel project. You can choose from Search or Vector Search collection types depending on your use case, whether that is full-text search or semantic and AI-powered search.&lt;/p&gt; 
&lt;h2&gt;How the architecture works&lt;/h2&gt; 
&lt;p&gt;The new Amazon OpenSearch Serverless architecture separates compute from storage entirely. OCUs are stateless and read from and write to a distributed shared storage layer that is accessible to both indexing and search OCUs. The storage layer is designed for high durability, keeping your data available independently of the compute nodes that process it.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/27/BDB-5937-1.png" alt="Architecture diagram showing OpenSearch Serverless NextGen with stateless indexing and search OCUs reading from and writing to a shared distributed storage layer" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;This design has two practical consequences:&lt;/p&gt; 
&lt;ol type="1"&gt; 
 &lt;li&gt;&lt;strong&gt;Fast provisioning.&lt;/strong&gt; New OCUs start serving requests in seconds because there is no local disk to bootstrap. The OCU mounts the shared storage layer and begins processing immediately.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Efficient scale down.&lt;/strong&gt; Idle capacity can be released with no impact to your stored data, because the data never lived on the OCU. When traffic subsides, compute resources are released and your cost drops accordingly.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;h3&gt;Architecture comparison&lt;/h3&gt; 
&lt;p&gt;The following table summarizes the key differences between the original and new architectures:&lt;/p&gt; 
&lt;table border="1px" width="100%" cellpadding="10px"&gt; 
 &lt;tbody&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;strong&gt;Capability&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Classic Architecture&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;NextGen Architecture&lt;/strong&gt;&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Minimum capacity&lt;/td&gt; 
   &lt;td&gt;2 OCUs (always on)&lt;/td&gt; 
   &lt;td&gt;0 OCUs (scale to zero)&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Scaling speed&lt;/td&gt; 
   &lt;td&gt;Minutes&lt;/td&gt; 
   &lt;td&gt;Seconds&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Storage&lt;/td&gt; 
   &lt;td&gt;Local storage per compute node&lt;/td&gt; 
   &lt;td&gt;Distributed shared storage (decoupled)&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Collection organization&lt;/td&gt; 
   &lt;td&gt; &lt;p&gt;Individual collections (Default)&lt;/p&gt; &lt;p&gt;Collection groups (Optional)&lt;/p&gt;&lt;/td&gt; 
   &lt;td&gt;Collection groups (required)&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Cold start from zero&lt;/td&gt; 
   &lt;td&gt;N/A (always on)&lt;/td&gt; 
   &lt;td&gt;~10 seconds&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Endpoint&lt;/td&gt; 
   &lt;td&gt;Per-collection endpoint&lt;/td&gt; 
   &lt;td&gt;Regional endpoint (static per account)&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Cost vs.&amp;nbsp;OpenSearch Service domain&lt;/td&gt; 
   &lt;td&gt;Baseline&lt;/td&gt; 
   &lt;td&gt;Up to 60% lower cost&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Scaling speed (vs.&amp;nbsp;Classic)&lt;/td&gt; 
   &lt;td&gt;Baseline&lt;/td&gt; 
   &lt;td&gt;Up to 20 times faster than baseline&lt;/td&gt; 
  &lt;/tr&gt; 
 &lt;/tbody&gt; 
&lt;/table&gt; 
&lt;h2&gt;Walkthrough: Create a vector collection and observe scale to zero&lt;/h2&gt; 
&lt;p&gt;In this walkthrough, you create a vector search collection with Express Create, index a few sample documents with embeddings, run a k-nearest neighbor (k-NN) query, and watch the collection scale to zero in Amazon CloudWatch. The entire process takes about 10 minutes.&lt;/p&gt; 
&lt;h3&gt;Prerequisites&lt;/h3&gt; 
&lt;ul&gt; 
 &lt;li&gt;An AWS account with permissions to create Amazon OpenSearch Serverless collections.&lt;/li&gt; 
 &lt;li&gt;AWS Command Line Interface (AWS CLI) configured with appropriate credentials.&lt;/li&gt; 
 &lt;li&gt;curl 7.75 or later (for built-in &lt;code&gt;--aws-sigv4&lt;/code&gt; support).&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h3&gt;Step 1: Configure security policies&lt;/h3&gt; 
&lt;p&gt;Create encryption, network, and data access policies. These must exist before the collection can be created.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-bash"&gt;# Create an encryption policy
aws opensearchserverless create-security-policy \
    --name product-vectors-encryption \
    --type encryption \
    --policy '{"Rules":[{"ResourceType":"collection","Resource":["collection/product-vectors"]}],"AWSOwnedKey":true}' \
    --endpoint-url "https://aoss.us-east-2.amazonaws.com" \
    --region "us-east-2"

# Create a network policy (public access for this tutorial)
aws opensearchserverless create-security-policy \
    --name product-vectors-network \
    --type network \
    --policy '[{"Rules":[{"ResourceType":"collection","Resource":["collection/product-vectors"]},{"ResourceType":"dashboard","Resource":["collection/product-vectors"]}],"AllowFromPublic":true}]' \
    --endpoint-url "https://aoss.us-east-2.amazonaws.com" \
    --region "us-east-2"

# Get your principal ARN
PRINCIPAL_ARN=$(aws sts get-caller-identity --query 'Arn' --output text)

# Create a data access policy
aws opensearchserverless create-access-policy \
    --name product-vectors-data \
    --type data \
    --policy "[{\"Rules\":[{\"ResourceType\":\"index\",\"Resource\":[\"index/product-vectors/*\"],\"Permission\":[\"aoss:CreateIndex\",\"aoss:DescribeIndex\",\"aoss:UpdateIndex\",\"aoss:DeleteIndex\",\"aoss:ReadDocument\",\"aoss:WriteDocument\"]}],\"Principal\":[\"\${PRINCIPAL_ARN}\"]}]" \
    --endpoint-url "https://aoss.us-east-2.amazonaws.com" \
    --region "us-east-2"&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;blockquote&gt;
 &lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; If you use the AWS console’s Express Create workflow, these policies are created automatically.&lt;/p&gt; 
 &lt;p&gt;&lt;strong&gt;Important:&lt;/strong&gt; After creating the data access policy, wait approximately 30 to 60 seconds for the policy to propagate before making API calls to the collection. If you receive a 403 Forbidden error, wait and retry.&lt;/p&gt;
&lt;/blockquote&gt; 
&lt;h3&gt;Step 2: Create a collection group and collection&lt;/h3&gt; 
&lt;p&gt;Create a collection group with scale-to-zero capacity limits, then create a vector search collection within it.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-bash"&gt;# Create a collection group with scale-to-zero enabled (min OCU = 0)
aws opensearchserverless create-collection-group \
    --name product-search-cg \
    --generation NEXTGEN \
    --standby-replicas ENABLED \
    --capacity-limits "minIndexingCapacityInOCU=0,maxIndexingCapacityInOCU=4,minSearchCapacityInOCU=0,maxSearchCapacityInOCU=4" \
    --endpoint-url "https://aoss.us-east-2.amazonaws.com" \
    --region "us-east-2"

# Create a vector search collection in the group
aws opensearchserverless create-collection \
    --name product-vectors \
    --type VECTORSEARCH \
    --collection-group-name product-search-cg \
    --endpoint-url "https://aoss.us-east-2.amazonaws.com" \
    --region "us-east-2"&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;The collection status transitions to ACTIVE within seconds.&lt;/p&gt; 
&lt;h3&gt;Step 3: Create a vector index&lt;/h3&gt; 
&lt;p&gt;Retrieve the collection endpoint and create a k-NN index using 3-dimensional vectors:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-bash"&gt;ENDPOINT=$(aws opensearchserverless batch-get-collection \
    --names product-vectors \
    --query 'collectionDetails[0].collectionEndpoint' \
    --output text \
    --endpoint-url "https://aoss.us-east-2.amazonaws.com" \
    --region "us-east-2")

awscurl --service aoss --region us-east-2 \
    -XPUT "${ENDPOINT}/items" \
    -H "Content-Type: application/json" \
    -d '{
      "settings": {"index.knn": true},
      "mappings": {
        "properties": {
          "description": {"type": "text"},
          "embedding": {"type": "knn_vector", "dimension": 3,
            "method": {"name": "hnsw", "space_type": "cosinesimil", "engine": "faiss"}}
        }
      }
    }'&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;blockquote&gt;
 &lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; If the collection has scaled to zero, the first request might take a few seconds while capacity scales up. If the request times out, wait 10 to 15 seconds and retry.&lt;/p&gt;
&lt;/blockquote&gt; 
&lt;h3&gt;Step 4: Index sample documents with embeddings&lt;/h3&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-bash"&gt;awscurl --service aoss --region us-east-2 \
    -XPOST "${ENDPOINT}/items/_bulk" \
    -H "Content-Type: application/json" \
    -d '
{ "index": { "_id": "1" } }
{ "description": "Wireless noise-cancelling headphones", "embedding": [0.8, 0.2, 0.1] }
{ "index": { "_id": "2" } }
{ "description": "Portable Bluetooth speaker", "embedding": [0.7, 0.3, 0.2] }
{ "index": { "_id": "3" } }
{ "description": "Over-ear studio monitor headphones", "embedding": [0.9, 0.1, 0.05] }
'&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;h3&gt;Step 5: Run a k-NN query&lt;/h3&gt; 
&lt;p&gt;Search for the two nearest neighbors to a query vector. Wait 30 seconds after indexing to allow the vector index to build before running this query:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-bash"&gt;awscurl --service aoss --region us-east-2 \
    -XGET "${ENDPOINT}/items/_search" \
    -H "Content-Type: application/json" \
    -d '{
      "query": {
        "knn": {
          "embedding": {
            "vector": [0.85, 0.15, 0.08],
            "k": 2
          }
        }
      }
    }'&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;The response returns the two most similar items, in this case, the headphone documents whose embeddings are closest to your query vector.&lt;/p&gt; 
&lt;p&gt;You can also run this query in OpenSearch UI by navigating to your collection in the Amazon OpenSearch Service console and choosing the OpenSearch UI Application URL. Then follow the steps outlined in &lt;a href="https://aws.amazon.com/blogs/big-data/amazon-opensearch-service-launches-the-next-generation-opensearch-ui/" target="_blank" rel="noopener"&gt;this blog&lt;/a&gt; to create a workspace. Then navigate to Dev Tools and paste and run the following query.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-plaintext"&gt;GET items/_search
{
  "query": {
    "knn": {
      "embedding": {
        "vector": [0.85, 0.15, 0.08],
        "k": 2
      }
    }
  }
}&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;h3&gt;Step 6: Observe scale to zero&lt;/h3&gt; 
&lt;p&gt;After a period of inactivity (no indexing or search traffic), the collection group scales down to 0 OCU. Verify with:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-bash"&gt;aws opensearchserverless batch-get-collection-group \
    --names product-search-cg \
    --endpoint-url "https://aoss.us-east-2.amazonaws.com" \
    --region "us-east-2"&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;In the response, &lt;code&gt;currentCapacity.search.capacityInOcu&lt;/code&gt; and &lt;code&gt;currentCapacity.indexing.capacityInOcu&lt;/code&gt; will show 0 after the collection has scaled down.&lt;/p&gt; 
&lt;p&gt;You can also navigate to the &lt;strong&gt;Collection groups&lt;/strong&gt; page in the Amazon OpenSearch Service console. Choose your collection group, then scroll down to the &lt;strong&gt;Monitoring&lt;/strong&gt; section. Here you can see two charts: &lt;strong&gt;Indexing capacity (OCUs)&lt;/strong&gt; and &lt;strong&gt;Search capacity (OCUs)&lt;/strong&gt;. After 10 minutes of idle time (no indexing or search requests), both metrics drop to zero, confirming that the service has released all compute resources for your collection.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/27/BDB-5937-2.png" alt="CloudWatch monitoring charts in the Amazon OpenSearch Service console showing indexing and search capacity dropping to zero OCUs after 10 minutes of idle time" width="600"&gt;&lt;/p&gt; 
&lt;h3&gt;Clean up&lt;/h3&gt; 
&lt;p&gt;To avoid ongoing charges, delete the resources you created in this walkthrough when you are done. Delete the collection first so the collection group becomes empty, then delete the group, then remove the security and access policies.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-bash"&gt;# Look up the collection ID, then delete the collection
COLLECTION_ID=$(aws opensearchserverless batch-get-collection \
    --names product-vectors \
    --query 'collectionDetails[0].id' \
    --output text \
    --endpoint-url "https://aoss.us-east-2.amazonaws.com" \
    --region "us-east-2")

aws opensearchserverless delete-collection \
    --id "${COLLECTION_ID}" \
    --endpoint-url "https://aoss.us-east-2.amazonaws.com" \
    --region "us-east-2"

# Look up the collection group ID, then delete the collection group
GROUP_ID=$(aws opensearchserverless batch-get-collection-group \
    --names product-search-cg \
    --query 'collectionGroupDetails[0].id' \
    --output text \
    --endpoint-url "https://aoss.us-east-2.amazonaws.com" \
    --region "us-east-2")

aws opensearchserverless delete-collection-group \
    --id "${GROUP_ID}" \
    --endpoint-url "https://aoss.us-east-2.amazonaws.com" \
    --region "us-east-2"

# Delete the security and access policies
aws opensearchserverless delete-security-policy \
    --name product-vectors-encryption \
    --type encryption \
    --endpoint-url "https://aoss.us-east-2.amazonaws.com" \
    --region "us-east-2"

aws opensearchserverless delete-security-policy \
    --name product-vectors-network \
    --type network \
    --endpoint-url "https://aoss.us-east-2.amazonaws.com" \
    --region "us-east-2"

aws opensearchserverless delete-access-policy \
    --name product-vectors-data \
    --type data \
    --endpoint-url "https://aoss.us-east-2.amazonaws.com" \
    --region "us-east-2"&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;h2&gt;Upgrading existing collections&lt;/h2&gt; 
&lt;p&gt;To move to the new architecture, create a new collection group and collection, then reindex your data into it. For a step-by-step walkthrough of the reindexing process, refer to &lt;a href="https://aws.amazon.com/blogs/big-data/perform-reindexing-in-amazon-opensearch-serverless-using-amazon-opensearch-ingestion/" target="_blank" rel="noopener"&gt;Perform reindexing in Amazon OpenSearch Serverless using Amazon OpenSearch Ingestion&lt;/a&gt;. Your queries and index mappings remain the same. Only the collection endpoint changes. With the new static Regional endpoint, that is a one-time update.&lt;/p&gt; 
&lt;p&gt;The new architecture supports SEARCH and VECTORSEARCH collection types. TIMESERIES is not supported at launch.&lt;/p&gt; 
&lt;h2&gt;Conclusion&lt;/h2&gt; 
&lt;p&gt;The new Amazon OpenSearch Serverless architecture is available today. You can create your first OpenSearch Serverless collection in seconds with Express Create, scale it to handle production traffic, and your OpenSearch Serverless compute costs drop to zero when it sits idle.&lt;/p&gt; 
&lt;p&gt;To learn more:&lt;/p&gt; 
&lt;ol type="1"&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless.html" target="_blank" rel="noopener"&gt;Amazon OpenSearch Service documentation&lt;/a&gt;.&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://us-east-1.console.aws.amazon.com/aos/home?region=us-east-1#opensearch" target="_blank" rel="noopener"&gt;Amazon OpenSearch Service console&lt;/a&gt;.&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://aws.amazon.com/opensearch-service/pricing/" target="_blank" rel="noopener"&gt;Amazon OpenSearch Service pricing page&lt;/a&gt;.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;If you have questions or feedback, open a support case or reach out through your AWS account team. We look forward to seeing what you build.&lt;/p&gt; 
&lt;hr&gt; 
&lt;h2&gt;About the authors&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt; 
   &lt;p&gt;&lt;img loading="lazy" class="alignleft size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/27/BDB-5937-3.jpg" alt="Sohaib Katariwala" width="100" height="100"&gt;&lt;/p&gt; 
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Sohaib Katariwala&lt;/h3&gt; 
  &lt;p&gt;Sohaib is a Senior Specialist Solutions Architect at AWS focused on Amazon OpenSearch Service based out of Chicago, IL. His interests are in all things data and analytics. More specifically he loves to help customers use AI in their data strategy to solve modern day challenges.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt; 
   &lt;p&gt;&lt;img loading="lazy" class="alignleft size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/27/BDB-5937-4.png" alt="Raj Ramasubbu" width="100" height="100"&gt;&lt;/p&gt; 
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Raj Ramasubbu&lt;/h3&gt; 
  &lt;p&gt;Raj is a Senior Analytics and AI Specialist Solutions Architect at AWS, focused on big data, analytics, and AI/ML. He partners with customers to architect and build highly scalable, performant, and secure cloud-based solutions.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt; 
   &lt;p&gt;&lt;img loading="lazy" class="alignleft size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/27/BDB-5937-5.png" alt="Arjun Nambiar" width="100" height="100"&gt;&lt;/p&gt; 
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Arjun Nambiar&lt;/h3&gt; 
  &lt;p&gt;Arjun is a Product Manager with Amazon OpenSearch Service. He focuses on ingestion technologies that enable ingesting data from a wide variety of sources into Amazon OpenSearch Service at scale. Arjun is interested in large-scale distributed systems and cloud-centered technologies, and is based out of Seattle, Washington.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
		
		
			</item>
		<item>
		<title>How Buildkite Operates Test Analytics at Massive Scale with Amazon MSK and Amazon Managed Service for Apache Flink</title>
		<link>https://aws.amazon.com/blogs/big-data/how-buildkite-operates-test-analytics-at-massive-scale-with-amazon-msk-and-amazon-managed-service-for-apache-flink/</link>
		
		<dc:creator><![CDATA[James Hill]]></dc:creator>
		<pubDate>Wed, 27 May 2026 18:22:34 +0000</pubDate>
				<category><![CDATA[Amazon Managed Service for Apache Flink]]></category>
		<category><![CDATA[Amazon Managed Streaming for Apache Kafka (Amazon MSK)]]></category>
		<category><![CDATA[Customer Solutions]]></category>
		<category><![CDATA[Intermediate (200)]]></category>
		<guid isPermaLink="false">e53a7ef7605133508cab64794cc43ad3b1a0ffc4</guid>

					<description>In this post, we explore how Buildkite uses Amazon Managed Streaming for Apache Kafka (Amazon MSK) and Amazon Managed Service for Apache Flink to power Test Engine’s streaming-first analytics architecture at scale.</description>
										<content:encoded>&lt;p&gt;When engineering teams at Slack, Reddit, Canva, Airbnb, Shopify, and Uber need to ship code with confidence, they rely on &lt;a href="https://buildkite.com/platform/" target="_blank" rel="noopener"&gt;Buildkite&lt;/a&gt;. As a CI/CD platform, Buildkite orchestrates complex build, test, and deployment pipelines for some of the most demanding engineering organizations in the world. It handles everything from routine code commits to artificial intelligence (AI) model-training workloads, processing over &lt;strong&gt;50 billion requests per month&lt;/strong&gt;.&lt;/p&gt; 
&lt;p&gt;At the heart of Buildkite’s test orchestration portfolio is Test Engine, a specialized analytics product designed to help engineering teams understand and optimize their test suites at scale. Test Engine aggregates results across thousands of builds, flags flaky tests, runs parallel test execution across machine fleets, and delivers interactive analytics on test execution data. It supports arbitrary metadata tagging for dimensions like instance type, architecture, language version, cloud provider, and feature flags.&lt;/p&gt; 
&lt;p&gt;The challenge? Delivering all of this in real time, across multiple enterprise tenants, at a volume that would stress even the most robust data infrastructure. In this post, we explore how Buildkite uses Amazon Managed Streaming for Apache Kafka (Amazon MSK) and Amazon Managed Service for Apache Flink to power Test Engine’s streaming-first analytics architecture at scale.&lt;/p&gt; 
&lt;h2&gt;The problem: When scale breaks traditional architectures&lt;/h2&gt; 
&lt;p&gt;Buildkite’s Test Engine must ingest and serve analytics on test telemetry from thousands of distributed pipelines simultaneously, for multiple enterprise customers. The scale is unforgiving: 50 billion test executions per month, 500K events per second at peak ingestion, and webhook payloads reaching 21 MB.&lt;/p&gt; 
&lt;h2&gt;The architectural evolution and its limits&lt;/h2&gt; 
&lt;p&gt;The original Rails and PostgreSQL stack couldn’t sustain this growth. In 2024, the team re-architected around a distributed streaming layer, a stateful stream processor for pre-aggregations, and multiple specialized stores: a key-value store for fast lookups, a relational database for pre-computed aggregates, and an open table format (Iceberg) with a distributed query engine (Trino) for flexible querying.&lt;/p&gt; 
&lt;p&gt;Yet the core tension remained unsolved. Enterprise customers demanded interactive, arbitrary slicing of billions of records across high-cardinality dimensions, not canned reports. The stream processor couldn’t handle ad hoc aggregations at query time. The key-value store was blind to analytical queries. The distributed query engine offered flexibility but was too slow for interactive use.&lt;/p&gt; 
&lt;p&gt;The result was a system that was expensive and operationally complex. It included nine relational database clusters, sprawling ETL pipelines, and 24/7 pre-aggregation jobs running regardless of demand. It still couldn’t deliver the one thing customers needed most: &lt;strong&gt;fast, flexible, interactive analytics at scale&lt;/strong&gt;.&lt;/p&gt; 
&lt;h2&gt;Architecture and implementation: MSK and Amazon Managed Service for Apache Flink as the streaming backbone&lt;/h2&gt; 
&lt;p&gt;The solution Buildkite arrived at centers on &lt;strong&gt;Amazon Managed Streaming for Apache Kafka (Amazon MSK)&lt;/strong&gt; and &lt;strong&gt;Amazon Managed Service for Apache Flink&lt;/strong&gt; as the real-time data streaming and processing layers, decoupling high-throughput ingestion from downstream analytics.&lt;/p&gt; 
&lt;h3&gt;The data pipeline&lt;/h3&gt; 
&lt;p&gt;The following diagram shows the end-to-end data flow from CI/CD agents through Amazon MSK and Amazon Managed Service for Apache Flink to the analytics layer.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/19/BDB-5828-1.png" alt="Architecture diagram showing the Buildkite data pipeline from CI/CD agents through Amazon MSK and Amazon Managed Service for Apache Flink to ClickHouse analytics" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;Amazon MSK sits at the critical junction between data producers (the distributed CI/CD agents and test collectors running across customer infrastructure) and the downstream processing and analytics layers. Amazon Managed Service for Apache Flink then transforms those raw event streams into enriched, queryable data before it reaches the analytics store.&lt;/p&gt; 
&lt;h3&gt;High-throughput ingestion from CI/CD pipelines&lt;/h3&gt; 
&lt;p&gt;Amazon MSK’s role begins at ingestion. Test collectors embedded in CI/CD pipelines publish test execution events directly to Kafka topics. The existing Amazon MSK cluster handles between &lt;strong&gt;5 MB/sec and 100 MB/sec&lt;/strong&gt; of inbound data under normal operating conditions. The architecture is designed to absorb the significant variance inherent in CI/CD workloads, where pipeline activity is bursty and correlated with engineering team working hours across global time zones.&lt;/p&gt; 
&lt;p&gt;When the Buildkite project was initiated, MSK Express Brokers were not yet available, leading the team to adopt MSK Tiered Storage as the primary mechanism for scaling and recovery. With MSK Express Brokers now generally available, the team is evaluating a migration of its most critical log ingestion workload, which sustains up to 1 GB/s at peak ingestion. MSK Express Brokers bring automatic storage scaling with zero storage management overhead, up to 20x faster scaling and 90% faster broker recovery, 3x higher per-broker throughput, 5x more partitions per broker, and built-in Intelligent Rebalancing.&lt;/p&gt; 
&lt;h3&gt;Real-time stream processing with Amazon Managed Service for Apache Flink&lt;/h3&gt; 
&lt;p&gt;Sitting between Amazon MSK and the analytics layer, Amazon Managed Service for Apache Flink acts as the stateful stream processing engine that transforms raw event streams before they reach downstream systems. Buildkite selected Flink for its exactly-once processing, mature stateful computation model, and deep Kafka integration. Handling sustained peaks of over 25,000 events per second, Amazon Managed Service for Apache Flink eliminates the operational overhead of cluster provisioning, version upgrades, checkpointing, and job recovery. This frees engineering teams to focus on application logic.&lt;/p&gt; 
&lt;p&gt;Amazon Managed Service for Apache Flink powers key stateful processing tasks, including flaky test detection through time-windowed pattern matching, enriching execution events with pipeline and customer metadata, and routing processed data to downstream systems such as ClickHouse for analytics, PostgreSQL for operational workloads, and Amazon Simple Storage Service (Amazon S3) for long-term archival.&lt;/p&gt; 
&lt;h3&gt;Reliability and fault tolerance&lt;/h3&gt; 
&lt;p&gt;Amazon MSK’s three-replica configuration ensures that no single broker failure can cause data loss or ingestion interruption. Combined with flexible data retention, the architecture provides a meaningful replay window. If a downstream consumer (Amazon Managed Service for Apache Flink, ClickHouse, or another service) experiences an outage, it can resume processing from its last committed offset without data loss.&lt;/p&gt; 
&lt;p&gt;During the migration to the current architecture, Buildkite employed a &lt;strong&gt;dual-write strategy&lt;/strong&gt;: simultaneously writing to both the existing PostgreSQL pipeline and the new Amazon MSK/ClickHouse path. This approach allowed the team to validate data consistency and gradually shift traffic without risking customer-facing disruption. This pattern speaks to the operational maturity Amazon MSK provides.&lt;/p&gt; 
&lt;h3&gt;Operational efficiency gains&lt;/h3&gt; 
&lt;p&gt;The shift to a streaming-first architecture, combined with the downstream simplification of the analytics engine, produced significant operational improvements:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Flink workloads reduced by 60%+&lt;/strong&gt;: Eliminating pre-aggregation jobs that ran continuously regardless of demand.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Key/value store completely retired&lt;/strong&gt;: Amazon MSK’s buffering capability, combined with ClickHouse’s query performance, eliminated the need for a separate fast-lookup store.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;PostgreSQL capacity cut in half&lt;/strong&gt;: Nine separate database clusters consolidated and right-sized.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Thousands of lines of application code deleted&lt;/strong&gt;: Simpler architecture means less ETL code, fewer failure modes, and faster onboarding for new engineers.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h3&gt;Platform performance at a glance&lt;/h3&gt; 
&lt;table border="1px" cellpadding="10px" width="100%"&gt; 
 &lt;tbody&gt;
  &lt;tr&gt; 
   &lt;td&gt;&lt;strong&gt;Metric&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Value&lt;/strong&gt;&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Monthly test executions (for test engine platform)&lt;/td&gt; 
   &lt;td&gt;50 billion (4x growth from 3B)&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Sustained peak ingestion&lt;/td&gt; 
   &lt;td&gt;500K events/second&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Total records in analytics store&lt;/td&gt; 
   &lt;td&gt;200 billion&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Log ingestion requests&lt;/td&gt; 
   &lt;td&gt;70,000+ per second&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Peak webhook throughput&lt;/td&gt; 
   &lt;td&gt;1.7 GB/second&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;MSK inbound throughput range&lt;/td&gt; 
   &lt;td&gt;5 MB/sec – 100 MB/sec&lt;/td&gt; 
  &lt;/tr&gt; 
 &lt;/tbody&gt;
&lt;/table&gt; 
&lt;h2&gt;Business and developer impact&lt;/h2&gt; 
&lt;p&gt;The technical architecture ultimately exists to serve one purpose: helping developers ship better software faster. The streaming-first architecture built on Amazon MSK and Amazon Managed Service for Apache Flink delivers on that promise across four dimensions.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;On-demand analytics&lt;/strong&gt; replaced pre-computed reports. Customers can now interactively slice and dice 70 billion records across arbitrary metadata dimensions. They get answers to queries like “Show me P50 test durations by instance type and architecture for the last 30 days” in seconds, not hours. &lt;strong&gt;Real-time log streaming&lt;/strong&gt; through the “live tail” feature means developers no longer wait for a build to complete before diagnosing failures. At 25,000 events per second, this experience scales across thousands of concurrent enterprise pipelines without degradation.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Smarter test intelligence&lt;/strong&gt; comes from Amazon Managed Service for Apache Flink’s stateful flaky test detection: when a test begins exhibiting intermittent failure patterns, Amazon Managed Service for Apache Flink identifies it as it happens, not after the fact. This is what separates a proactive analytics platform from a reactive one. It requires publishing data to Kafka, processing with Flink, and letting ClickHouse handle the complex read requests.&lt;/p&gt; 
&lt;h2&gt;Conclusion: Streaming as a strategic foundation&lt;/h2&gt; 
&lt;p&gt;Buildkite’s journey from a Rails/Postgres monolith to a streaming-first analytics platform reflects a pattern increasingly common among enterprise SaaS companies: a reliable, high-throughput streaming and processing layer is not an optimization. It is a prerequisite for operating at scale.&lt;/p&gt; 
&lt;p&gt;Amazon MSK and Amazon Managed Service for Apache Flink form the backbone that helps Buildkite ingest 50 billion test executions per month, serve real-time interactive analytics to enterprise customers, and do so at lower cost than the more complex architecture it replaced. Amazon MSK handles durable, elastic event buffering. Amazon Managed Service for Apache Flink transforms raw streams into enriched, queryable data. Together they absorb the operational complexity that would otherwise consume engineering capacity.&lt;/p&gt; 
&lt;p&gt;For platform engineers evaluating streaming infrastructure for multi-tenant SaaS workloads, the signal is clear: invest in the streaming backbone early, and let managed services handle the operational complexity.&lt;/p&gt; 
&lt;p&gt;To learn more about Amazon MSK and Amazon Managed Service for Apache Flink, visit &lt;a href="https://aws.amazon.com/msk" target="_blank" rel="noopener"&gt;aws.amazon.com/msk&lt;/a&gt; and &lt;a href="https://aws.amazon.com/managed-service-apache-flink" target="_blank" rel="noopener"&gt;aws.amazon.com/managed-service-apache-flink&lt;/a&gt;.&lt;/p&gt; 
&lt;hr&gt; 
&lt;h2&gt;About the authors&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt; 
   &lt;p&gt;&lt;img loading="lazy" class="alignleft size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/19/BDB-5828-2.jpeg" alt="James Hill" width="100" height="100"&gt;&lt;/p&gt; 
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;James Hill&lt;/h3&gt; 
  &lt;p&gt;James has been building and scaling software systems for more than 25 years, from early web applications to platforms that now process millions of builds every day. Starting his career as a software engineer, James has led teams across Australia, the UK, and globally, solving problems in performance, reliability, and delivery speed at massive scale. Today, he works with some of the world’s largest engineering organizations to help them ship faster and with greater confidence, drawing on deep, hands-on experience in both engineering and product leadership. James is passionate about turning testing from a bottleneck into a feedback engine that accelerates learning across an organization.&lt;/p&gt;
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt; 
   &lt;p&gt;&lt;img loading="lazy" class="alignleft size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/19/BDB-5828-3.png" alt="Mitch James" width="100" height="100"&gt;&lt;/p&gt; 
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Mitch James&lt;/h3&gt; 
  &lt;p&gt;Mitch is a Brand and Marketing Strategist with deep expertise crafting end-to-end brand experiences and fostering engaged communities around technical tooling. He brings 15+ years of Brand, Design, and Marketing leadership across devtools, consumer product, and B2B enterprise. Previously, Mitch has built and led creative teams at Adobe, IBM, Salesforce, George P Johnson, Wunderman Thompson, and VML. Today, he leads global marketing and design for Buildkite, working with engineering teams who set the pace at the frontier of software delivery.&lt;/p&gt;
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt; 
   &lt;p&gt;&lt;img loading="lazy" class="alignleft size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/19/BDB-5828-4.jpeg" alt="Masudur Rahaman Sayem" width="100" height="100"&gt;&lt;/p&gt; 
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Masudur Rahaman Sayem&lt;/h3&gt; 
  &lt;p&gt;Masudur is a Streaming Data Architect at AWS with over 25 years of experience in the IT industry. He collaborates with AWS customers worldwide to architect and implement data streaming solutions that address complex business challenges. As an expert in distributed computing, Sayem specializes in designing large-scale distributed systems architecture for maximum performance and scalability. He has a keen interest and passion for distributed architecture, which he applies to designing production-ready solutions at internet scale.&lt;/p&gt;
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt; 
   &lt;p&gt;&lt;img loading="lazy" class="alignleft size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/19/BDB-5828-5.jpeg" alt="Miranda Li" width="100" height="100"&gt;&lt;/p&gt; 
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Miranda Li&lt;/h3&gt; 
  &lt;p&gt;Miranda is a Senior Solutions Architect at AWS, specializing in Independent Software Vendor (ISV) and cloud-native architectures. With four years dedicated to helping software partners innovate and scale on AWS, she focuses on helping ISVs build and optimize their solutions for the cloud. She brings deep technical expertise in cloud infrastructure and data analytics, with a strong focus on supporting technical customers in areas such as Infrastructure as a Service (IaaS), network architecture, and security. Outside of work, she is an avid badminton player and enjoys staying active through jogging and outdoor adventures.&lt;/p&gt;
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
		
		
			</item>
		<item>
		<title>How Zynga scaled multi-warehouse data governance with Amazon Redshift federated permissions</title>
		<link>https://aws.amazon.com/blogs/big-data/how-zynga-scaled-multi-warehouse-data-governance-with-amazon-redshift-federated-permissions/</link>
					
		
		<dc:creator><![CDATA[Johan Eklund, Matthew Wongkee, Noelia Tardón]]></dc:creator>
		<pubDate>Wed, 27 May 2026 18:19:35 +0000</pubDate>
				<category><![CDATA[Advanced (300)]]></category>
		<category><![CDATA[Amazon Redshift]]></category>
		<category><![CDATA[Analytics]]></category>
		<category><![CDATA[Customer Solutions]]></category>
		<category><![CDATA[Experience-Based Acceleration]]></category>
		<category><![CDATA[Games]]></category>
		<category><![CDATA[Industries]]></category>
		<category><![CDATA[Security & Governance]]></category>
		<guid isPermaLink="false">a4ed03c2032d53cfcf0d35f38bbdf583c77e2214</guid>

					<description>In this post, we walk through how Zynga adopted Amazon Redshift federated permissions and AWS IAM Identity Center to enforce consistent, tiered data access across provisioned and serverless Amazon Redshift environments without building custom synchronization pipelines.</description>
										<content:encoded>&lt;p&gt;&lt;a href="https://www.zynga.com/" target="_blank" rel="noopener noreferrer"&gt;Zynga&lt;/a&gt;, a global leader in interactive entertainment operates a portfolio of mobile game studios including &lt;a href="https://www.socialpoint.es/games" target="_blank" rel="noopener noreferrer"&gt;Socialpoint,&lt;/a&gt; the creators of &lt;em&gt;Dragon City&lt;/em&gt; and &lt;em&gt;Monster Legends&lt;/em&gt;. Zynga’s analytics platform processes telemetry and revenue data across studios using &lt;a href="https://aws.amazon.com/redshift/" target="_blank" rel="noopener noreferrer"&gt;Amazon Redshift&lt;/a&gt; as its central data warehouse.&lt;/p&gt; 
&lt;p&gt;As Zynga expanded its analytics architecture to include individual studios with their own compute environments, the team faced a challenge: how to maintain centralized data governance while granting studios independent query capacity. Their existing approach to permission management introduced lag and required custom infrastructure if scaled to multiple warehouses.&lt;/p&gt; 
&lt;p&gt;In this post, we walk through how Zynga adopted &lt;a href="https://docs.aws.amazon.com/redshift/latest/dg/federated-permissions.html" target="_blank" rel="noopener noreferrer"&gt;Amazon Redshift federated permissions&lt;/a&gt; and &lt;a href="https://aws.amazon.com/iam/identity-center/" target="_blank" rel="noopener noreferrer"&gt;AWS IAM Identity Center&lt;/a&gt; to enforce consistent, tiered data access across provisioned and serverless Amazon Redshift environments without building custom synchronization pipelines.&lt;/p&gt; 
&lt;h2&gt;The challenge&lt;/h2&gt; 
&lt;p&gt;Zynga needed to onboard Socialpoint’s current Amazon Redshift workloads and make Zynga’s central data available to them. Zynga’s existing production cluster would house the Socialpoint raw data, but the compute would come from another warehouse set up as a consumer. At the same time, Zynga’s data access control policies would need to be enforced across all warehouses. Zynga uses a tiered access control policy which would need to be synced across all consumers with no permission lag or manual grant synchronization.&lt;/p&gt; 
&lt;p&gt;During the migration, Socialpoint’s specific extract, transform, and load (ETL) processes would be included in Zynga’s central ETLs and their data ingestion pipeline would be replaced by Zynga’s latest generation of data ingestion infrastructure. Because the migration process happens in stages, Amazon Redshift sizing would also gradually need to increase.&lt;/p&gt; 
&lt;p&gt;The team evaluated two alternatives before arriving at a solution:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;a href="https://aws.amazon.com/lake-formation/" target="_blank" rel="noopener noreferrer"&gt;AWS Lake Formation&lt;/a&gt; couldn’t manage local and cross-cluster permissions using the same interface, and required &lt;a href="https://aws.amazon.com/iam/" target="_blank" rel="noopener noreferrer"&gt;AWS Access and Identity Management (IAM)&lt;/a&gt; or IAM Identity Center authentication for all users including service accounts.&lt;/li&gt; 
 &lt;li&gt;Manual grants on consumer clusters introduced a delay between when permissions were updated on the producer and when they took effect on the consumer. This approach would also require an external job that synced permissions and would be unlikely to scale well beyond 2–3 consumers.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;Solution overview&lt;/h2&gt; 
&lt;p&gt;Zynga implemented a solution using three AWS services working together:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;Amazon Redshift federated permissions enabled cross-cluster queries without explicit data shares. Permissions granted on the producer cluster propagate immediately to consumer workgroups through &lt;a href="https://aws.amazon.com/glue/" target="_blank" rel="noopener noreferrer"&gt;AWS Glue&lt;/a&gt; Data Catalog registration.&lt;/li&gt; 
 &lt;li&gt;AWS IAM Identity Center provides unified authentication through federation with &lt;a href="https://www.okta.com/" target="_blank" rel="noopener noreferrer"&gt;Okta&lt;/a&gt;. When users sign in, their Okta group memberships are provisioned through a System for Cross-domain Identity Management (SCIM) and automatically map to Amazon Redshift roles, removing the need for external synchronization jobs.&lt;/li&gt; 
 &lt;li&gt;Amazon Redshift Serverless provides the compute layer for Socialpoint, scaling to zero when idle and avoiding the need to pre-size a provisioned cluster during the migration period.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;The architecture uses a dual-grant approach where every permission is granted to both an IAM Identity Center group (for users) and a federated IAM role (for service accounts). This gives both authentication paths the same access.&lt;/p&gt; 
&lt;h2&gt;How it works&lt;/h2&gt; 
&lt;h3&gt;Authentication with IAM Identity Center&lt;/h3&gt; 
&lt;p&gt;Zynga’s existing Okta directory syncs to IAM Identity Center, which is connected to the Amazon Redshift Serverless workgroup. When a user authenticates, Amazon Redshift automatically creates a user mapped to their email address and assigns them to roles based on their Okta group membership.&lt;/p&gt; 
&lt;p&gt;For example, an analyst in the &lt;em&gt;Gamma Tier&lt;/em&gt; group signs in and is automatically assigned the &lt;code&gt;AWSIDC:role.sso.gamma&lt;/code&gt; role in Amazon Redshift. No manual role assignment or synchronization job is required.&lt;/p&gt; 
&lt;p&gt;Service accounts, used for programmatic access, authenticate differently. Either using their IAM role and calling the get-credentials API, or by using the new federated permissions feature. Each service account assumes a federated IAM role, which creates a corresponding federated user in Amazon Redshift (for example, &lt;code&gt;IAMR:role_iam_gamma&lt;/code&gt;).&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignleft wp-image-90998 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5896-image-1.jpeg" alt="The identity layer shows two authentication paths. Interactive users authenticate through Okta, which syncs group memberships to AWS IAM Identity Center. On login, IAM Identity Center automatically maps users to tiered Amazon Redshift roles (AWSIDC:role.sso.gamma, beta, alpha). Service accounts take a separate path, assuming a cluster IAM role that creates a corresponding federated user (IAMR:role_iam_gamma) in Amazon Redshift." width="1319" height="615"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;Figure 1 – Identity layer&lt;/em&gt;&lt;/p&gt; 
&lt;h3&gt;The dual-grant approach&lt;/h3&gt; 
&lt;p&gt;To ensure that both users and service accounts can access the same data, every read permission is granted to both the IAM Identity Center group and the federated IAM role in a single statement:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-sql"&gt;GRANT SELECT ON schema.table TO
	'IAMR:role_iam_gamma',
	ROLE 'AWSIDC:role.sso.gamma';&lt;/code&gt;&lt;/pre&gt; 
 &lt;h3&gt;Transitioning the producer cluster&lt;/h3&gt; 
 &lt;p&gt;The shared provisioned cluster already had active users with local grants. To avoid disruption, Zynga implemented a tri-grant approach on the producer during the transition period. Existing stored procedures were modified to grant permissions to three targets: the legacy local role, the IAM Identity Center group, and the federated IAM role.&lt;/p&gt; 
 &lt;div class="hide-language"&gt; 
  &lt;pre&gt;&lt;code class="lang-sql"&gt;GRANT SELECT ON schema.table TO
	ROLE role_rs_gamma,
	ROLE 'AWSIDC:role.sso.gamma','IAMR:role_iam_gamma';&lt;/code&gt;&lt;/pre&gt; 
 &lt;/div&gt; 
 &lt;p&gt;This approach maintains backward compatibility for existing users on the producer while enabling immediate access from the new serverless workgroup. The long-term plan includes migration of all producer users to IAM Identity Center and retire the legacy local grants.&lt;/p&gt; 
 &lt;h3&gt;Stored procedures for consistent governance&lt;/h3&gt; 
 &lt;p&gt;Rather than requiring users to construct dual-grant statements manually, Zynga created stored procedures that encapsulate the grant logic:&lt;/p&gt; 
 &lt;ul&gt; 
  &lt;li&gt;&lt;code&gt;grant_read&lt;/code&gt; accepts a table name and access tier, then issues the appropriate dual-grant for both the IAM Identity Center group and the federated IAM role.&lt;/li&gt; 
  &lt;li&gt;&lt;code&gt;grant_write&lt;/code&gt; grants data definition language (DDL)&amp;nbsp;and data manipulation language (DML) permissions to the appropriate team-based role.&lt;/li&gt; 
 &lt;/ul&gt; 
 &lt;p&gt;This provides a consistent interface for permission management regardless of which cluster or workgroup that the user is on.&lt;/p&gt; 
 &lt;p&gt;&lt;img loading="lazy" class="alignleft size-full wp-image-90997" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5896-image-2.jpeg" alt="Architecture diagram in the AWS Cloud showing a federated Amazon Redshift setup. On the left, a Zynga provisioned Amazon Redshift cluster (producer) holds Zynga central data — game telemetry and revenue data — and stored procedures grant_read(table, tier) and grant_write(table, team). On the right, a Socialpoint Amazon Redshift Serverless workgroup (consumer) holds Socialpoint ETL data and the same two stored procedures. Bidirectional arrows between the two environments label the top connection as &amp;quot;Federated queries (cross-cluster reads)&amp;quot; between the data stores, and the bottom connection as &amp;quot;Federated permissions&amp;quot; between the stored procedures." width="1319" height="593"&gt;&lt;/p&gt; 
 &lt;p&gt;&lt;em&gt;Figure 2 – Data and compute layer&lt;/em&gt;&lt;/p&gt; 
 &lt;h2&gt;Results&lt;/h2&gt; 
 &lt;p&gt;The migration delivered measurable improvements:&lt;/p&gt; 
 &lt;ul&gt; 
  &lt;li&gt;Immediate permission propagation – Grants on the producer cluster took effect on the consumer workgroup instantly, replacing a process that previously required manual intervention and introduced lag.&lt;/li&gt; 
  &lt;li&gt;Zero additional infrastructure cost – Federated permissions, federated queries, and IAM Identity Center added no incremental costs to the architecture.&lt;/li&gt; 
  &lt;li&gt;Removed custom synchronization – The team removed the need for Lake Formation configurations, external AWS Lambda functions, and Airflow workflows for permission management.&lt;/li&gt; 
  &lt;li&gt;Scalable pattern – The same architecture can be extended to additional studio workgroups without duplicating permission management logic.&lt;/li&gt; 
 &lt;/ul&gt; 
 &lt;h2&gt;Lessons learned&lt;/h2&gt; 
 &lt;p&gt;Zynga adopted federated permissions shortly after the feature launched in US West Oregon (us-west-2) in January 2026. The team shared several observations from their early adoption:&lt;/p&gt; 
 &lt;ul&gt; 
  &lt;li&gt;Run a proof of concept first. The team validated the full permission model in a test environment before deploying to production, including testing that existing data shares to other clusters were not disrupted.&lt;/li&gt; 
  &lt;li&gt;Plan for the dual-grant requirement. Because IAM Identity Center users and federated IAM roles are distinct identity types, every read permission requires two grants. Encapsulating this in stored procedures prevents errors and reduces cognitive overhead.&lt;/li&gt; 
  &lt;li&gt;Start with serverless for new workloads. With Amazon Redshift Serverless, the team can avoid sizing decisions during the migration period. If usage patterns later justify it, they can migrate to a provisioned cluster from a serverless snapshot with minimal downtime.&lt;/li&gt; 
  &lt;li&gt;Engage with AWS. As an early adopter of a new feature, Zynga maintained regular contact with the Amazon Redshift team through their AWS Technical Account Manager to report issues and request enhancements.&lt;/li&gt; 
 &lt;/ul&gt; 
 &lt;h2&gt;Conclusion&lt;/h2&gt; 
 &lt;p&gt;Zynga’s adoption of Amazon Redshift federated permissions demonstrates how organizations with multi-cluster Amazon Redshift architectures can enforce centralized data governance without building custom synchronization infrastructure. By combining federated permissions with IAM Identity Center and Amazon Redshift Serverless, the team established a pattern that scales to additional studios while maintaining consistent access controls and reducing operational overhead.&lt;/p&gt; 
 &lt;p&gt;To learn more about the services used in this post, see the following resources:&lt;/p&gt; 
 &lt;ul&gt; 
  &lt;li&gt;&lt;a href="https://aws.amazon.com/blogs/big-data/simplify-multi-warehouse-data-governance-with-amazon-redshift-federated-permissions/" target="_blank" rel="noopener noreferrer"&gt;Simplify multi-warehouse data governance with Amazon Redshift federated permissions&lt;/a&gt;&lt;/li&gt; 
  &lt;li&gt;&lt;a href="https://aws.amazon.com/blogs/big-data/scale-fine-grained-permissions-across-warehouses-with-amazon-redshift-and-aws-iam-identity-center/" target="_blank" rel="noopener noreferrer"&gt;Scale fine-grained permissions across warehouses with Amazon Redshift and AWS IAM Identity Center&lt;/a&gt;&lt;/li&gt; 
  &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/redshift/latest/dg/federated-overview.html" target="_blank" rel="noopener noreferrer"&gt;Amazon Redshift federated queries&lt;/a&gt;.&lt;/li&gt; 
 &lt;/ul&gt; 
 &lt;hr style="width: 80%"&gt; 
 &lt;h2&gt;About the authors&lt;/h2&gt; 
 &lt;footer&gt; 
  &lt;div class="blog-author-box"&gt; 
   &lt;div class="blog-author-image"&gt;
    &lt;img loading="lazy" class="alignnone wp-image-91387 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/19/johan_eklund_2.jpg" alt="" width="100" height="103"&gt;
   &lt;/div&gt; 
   &lt;h3 class="lb-h4"&gt;Johan Eklund&lt;/h3&gt; 
   &lt;p&gt;&lt;a href="https://uk.linkedin.com/in/johan-eklund-587bb478" target="_blank" rel="noopener"&gt;Johan Eklund&lt;/a&gt; is a Principal Software Engineer on the Data Engineering team at Zynga, focused on building and operating large-scale data platforms on AWS. Day-to-day, his work spans Redshift architecture, ETL orchestration, platform operations, and security and access control. Outside of work, Johan enjoys exploring art and music through creative coding and DIY hardware synthesizers.&lt;/p&gt; 
  &lt;/div&gt; 
  &lt;div class="blog-author-box"&gt; 
   &lt;div class="blog-author-image"&gt;
    &lt;img loading="lazy" class="alignnone size-full wp-image-91348" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/14/mwongkee_headshot.jpg" alt="" width="100" height="100"&gt;
   &lt;/div&gt; 
   &lt;h3 class="lb-h4"&gt;Matthew Wongkee&lt;/h3&gt; 
   &lt;p&gt;&lt;a href="https://ca.linkedin.com/in/matthew-wongkee-6093a328" target="_blank" rel="noopener"&gt;Matthew Wongkee&lt;/a&gt; is a Principal Software Engineer at Zynga, where he builds scalable and high-performance data platforms and infrastructure, including Amazon Redshift. He focuses on enabling teams to leverage their data through self-service capabilities, including data discovery, query optimization, curated datasets, and data quality. In his spare time, he practices short track speed skating.&lt;/p&gt; 
  &lt;/div&gt; 
  &lt;div class="blog-author-box"&gt; 
   &lt;div class="blog-author-image"&gt;
    &lt;img loading="lazy" class="alignnone size-full wp-image-91347" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/14/Noelia-Tardón-Delgado.png" alt="" width="100" height="133"&gt;
   &lt;/div&gt; 
   &lt;h3 class="lb-h4"&gt;Noelia Tardón&lt;/h3&gt; 
   &lt;p&gt;&lt;a href="https://es.linkedin.com/in/tardondelgado" target="_blank" rel="noopener"&gt;Noelia Tardón&lt;/a&gt; is a Principal Software Engineer at Zynga. Based in Barcelona, she works on large-scale cloud data platforms with a focus on performance optimization and cost efficiency. She designs and builds data lake and data warehouse solutions while ensuring compliance with data protection regulations. Outside of work, she enjoys spending time with her family and running.&lt;/p&gt; 
  &lt;/div&gt; 
  &lt;div class="blog-author-box"&gt; 
   &lt;div class="blog-author-image"&gt;
    &lt;img loading="lazy" class="size-full wp-image-35755 alignleft" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/10/13/Sandeep.jpg" alt="" width="100" height="133"&gt;
   &lt;/div&gt; 
   &lt;h3 class="lb-h4"&gt;Sandeep Adwankar&lt;/h3&gt; 
   &lt;p&gt;&lt;a href="https://www.linkedin.com/in/adwankar/" target="_blank" rel="noopener"&gt;Sandeep&lt;/a&gt; is a Senior Technical Product Manager at AWS. Based in the California Bay Area, he works with customers around the globe to translate business and technical requirements into products that enable customers to improve how they manage, secure, and access data.&lt;/p&gt; 
  &lt;/div&gt; 
  &lt;div class="blog-author-box"&gt; 
   &lt;div class="blog-author-image"&gt;
    &lt;img loading="lazy" class="alignleft wp-image-89089" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/03/18/headshot2-1.jpg" alt="" width="100" height="133"&gt;
   &lt;/div&gt; 
   &lt;h3 class="lb-h4"&gt;Satesh Sonti&lt;/h3&gt; 
   &lt;p&gt;&lt;a href="https://www.linkedin.com/in/satish-kumar-sonti/"&gt;Satesh&lt;/a&gt; is a Principal Analytics Specialist Solutions Architect based out of Atlanta, specializing in building enterprise data platforms, data warehousing, and analytics solutions. He has over 19 years of experience in building data assets and leading complex data platform programs for banking and insurance clients across the globe.&lt;/p&gt; 
  &lt;/div&gt; 
  &lt;div class="blog-author-box"&gt; 
   &lt;div class="blog-author-image"&gt;
    &lt;img loading="lazy" class="alignleft size-full wp-image-90405" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/BDB-5843-image-12.jpeg" alt="Photo of AUthor - Steve Phillips" width="120" height="160"&gt;
   &lt;/div&gt; 
   &lt;h3 class="lb-h4"&gt;Steve Phillips&lt;/h3&gt; 
   &lt;p&gt;&lt;a href="https://www.linkedin.com/in/stevephillipsca" target="_blank" rel="noopener noreferrer"&gt;Steve&lt;/a&gt; is a Principal Technical Account Manager and Analytics specialist at AWS in the North America region. Steve currently focuses on data warehouse architectural design, data lakes, data ingestion pipelines, and cloud distributed architectures.&lt;/p&gt; 
  &lt;/div&gt; 
 &lt;/footer&gt; 
&lt;/div&gt;</content:encoded>
					
					
			
		
		
			</item>
	</channel>
</rss>