<?xml version="1.0" encoding="UTF-8" standalone="no"?><rss xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" version="2.0">

<channel>
	<title>AWS Big Data Blog</title>
	<atom:link href="https://aws.amazon.com/blogs/big-data/feed/" rel="self" type="application/rss+xml"/>
	<link>https://aws.amazon.com/blogs/big-data/</link>
	<description>Official Big Data Blog of Amazon Web Services</description>
	<lastBuildDate>Thu, 14 May 2026 16:58:32 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	
	<item>
		<title>Optimize Amazon S3 Tables queries with Amazon Redshift</title>
		<link>https://aws.amazon.com/blogs/big-data/optimize-amazon-s3-tables-queries-with-amazon-redshift/</link>
		
		<dc:creator><![CDATA[Tom Romano]]></dc:creator>
		<pubDate>Thu, 14 May 2026 16:58:31 +0000</pubDate>
				<category><![CDATA[Amazon Redshift]]></category>
		<category><![CDATA[Amazon S3 Tables]]></category>
		<category><![CDATA[Analytics]]></category>
		<guid isPermaLink="false">90a84f0b5b502e8e31dbc0dcfe595a3574d309b0</guid>

					<description>This is the third post in our S3 Tables and Amazon Redshift series. The first post covered getting started with querying Apache Iceberg tables, and the second post walked through enterprise-scale governance and access controls. In this post, you address those performance and usability gaps with three different approaches.</description>
										<content:encoded>&lt;p&gt;&lt;a href="https://aws.amazon.com/s3/features/tables/" target="_blank" rel="noopener"&gt;Amazon S3 Tables&lt;/a&gt; with &lt;a href="https://aws.amazon.com/redshift/" target="_blank" rel="noopener"&gt;Amazon Redshift&lt;/a&gt; gives you a powerful combination for analytical workloads on Apache Iceberg tables. But as query volumes grow, small inefficiencies compound. For example, repeated queries, such as dashboards refreshing hourly or analysts running the same joins throughout the day, scan data directly from &lt;a href="https://aws.amazon.com/s3/" target="_blank" rel="noopener"&gt;Amazon Simple Storage Service (Amazon S3)&lt;/a&gt; every time. The fully qualified three-part table references (&lt;code&gt;database@catalog.schema.table&lt;/code&gt;) add friction for business intelligence (BI) tools and end users who expect simpler SQL syntax. And without tuning the way S3 Tables organizes your data files, queries read more files than necessary. When you address these three areas, your S3 Tables queries in Amazon Redshift become faster, simpler, and more cost-efficient, whether you’re powering a recurring dashboard or supporting ad hoc analysis at scale.&lt;/p&gt; 
&lt;p&gt;This is the third post in our S3 Tables and Amazon Redshift series. The &lt;a href="https://aws.amazon.com/blogs/big-data/using-amazon-s3-tables-with-amazon-redshift-to-query-apache-iceberg-tables/" target="_blank" rel="noopener"&gt;first post&lt;/a&gt; covered getting started with querying &lt;a href="https://iceberg.apache.org/" target="_blank" rel="noopener"&gt;Apache Iceberg&lt;/a&gt; tables, and the &lt;a href="https://aws.amazon.com/blogs/big-data/scalable-analytics-and-centralized-governance-for-apache-iceberg-tables-using-amazon-s3-tables-and-amazon-redshift/" target="_blank" rel="noopener"&gt;second post&lt;/a&gt; walked through enterprise-scale governance and access controls. In this post, you address those performance and usability gaps with three approaches:&lt;/p&gt; 
&lt;ol type="1"&gt; 
 &lt;li&gt;Create external schemas to simplify queries from three-part notation down to two-part notation.&lt;/li&gt; 
 &lt;li&gt;Build materialized views that store pre-computed results locally so repeated queries skip the S3 scan.&lt;/li&gt; 
 &lt;li&gt;Configure S3 Tables compaction strategies so the data file layout matches your query patterns.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;The following diagram shows how these three approaches work together. External schemas [1] simplify query syntax through &lt;a href="https://docs.aws.amazon.com/lake-formation/latest/dg/resource-links-about.html" target="_blank" rel="noopener"&gt;AWS Lake Formation resource links&lt;/a&gt; [2], materialized views [3] store pre-computed results locally in Amazon Redshift, and S3 Tables compaction [4] optimizes the underlying file layout for your query patterns.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/07/BDB-5744-1.png" alt="Optimizing S3 Tables queries with external schemas, materialized views, and compaction strategies" width="600"&gt;&lt;/p&gt; 
&lt;h2&gt;Prerequisites&lt;/h2&gt; 
&lt;p&gt;Before you begin, make sure you have:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;An AWS account with permissions to manage &lt;a href="https://aws.amazon.com/iam/" target="_blank" rel="noopener"&gt;AWS Identity and Access Management&lt;/a&gt; (IAM) roles, &lt;a href="https://aws.amazon.com/lake-formation/" target="_blank" rel="noopener"&gt;AWS Lake Formation&lt;/a&gt;, S3 Tables, and Redshift.&lt;/li&gt; 
 &lt;li&gt;An &lt;a href="https://aws.amazon.com/redshift/redshift-serverless/" target="_blank" rel="noopener"&gt;Amazon Redshift Serverless&lt;/a&gt; workgroup or Amazon Redshift provisioned cluster (patch 188 or higher).&lt;/li&gt; 
 &lt;li&gt;An S3 Table bucket with a &lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-namespace-create.html" target="_blank" rel="noopener"&gt;namespace&lt;/a&gt; and &lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-create.html" target="_blank" rel="noopener"&gt;tables&lt;/a&gt; created.&lt;/li&gt; 
 &lt;li&gt;Lake Formation configured with the &lt;a href="https://docs.aws.amazon.com/redshift/latest/mgmt/using-service-linked-roles.html" target="_blank" rel="noopener"&gt;AWSServiceRoleForRedshift service-linked role&lt;/a&gt; as a read-only administrator.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;If you haven’t completed these steps, follow the setup instructions in the &lt;a href="https://aws.amazon.com/blogs/big-data/using-amazon-s3-tables-with-amazon-redshift-to-query-apache-iceberg-tables/" target="_blank" rel="noopener"&gt;first post in this series&lt;/a&gt;.&lt;/p&gt; 
&lt;h2&gt;Simplify queries with external schemas&lt;/h2&gt; 
&lt;p&gt;The previous posts in this series used the auto-mounted catalog to query S3 Tables with three-part notation:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;SELECT * FROM redshifticeberg@s3tablescatalog.icebergsons3.examples;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;You can use this syntax, but it can be cumbersome in business intelligence (BI) tools, manually typing queries, and in application code. This syntax also requires the user to use IAM federation. By creating an external schema, you can reference the same tables with a concise two-part notation:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;SELECT * FROM s3tables_schema.examples;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;To set this up, you create a Lake Formation resource link that maps to your S3 Tables catalog, then create an external schema in Amazon Redshift that points to that resource link. Your setup differs slightly depending on whether your users authenticate through IAM federation or database credentials. While this doesn’t change query performance, it removes a common barrier to adoption by simplifying the reference.&lt;/p&gt; 
&lt;h3&gt;Create a Lake Formation resource link&lt;/h3&gt; 
&lt;p&gt;Both authentication methods require a resource link in Lake Formation that points to your S3 Tables database.&lt;/p&gt; 
&lt;ol type="1"&gt; 
 &lt;li&gt;In the Lake Formation console, choose &lt;strong&gt;Databases&lt;/strong&gt; under &lt;strong&gt;Data Catalog&lt;/strong&gt;.&lt;/li&gt; 
 &lt;li&gt;On the &lt;strong&gt;Create&lt;/strong&gt; menu, choose &lt;strong&gt;Resource link&lt;/strong&gt;.&lt;/li&gt; 
 &lt;li&gt;Configure the resource link with the following settings: 
  &lt;ul&gt; 
   &lt;li&gt;&lt;strong&gt;Resource link name:&lt;/strong&gt; &lt;code&gt;s3tables_rl&lt;/code&gt;&lt;/li&gt; 
   &lt;li&gt;&lt;strong&gt;Destination Catalog:&lt;/strong&gt; Your account ID (for example, &lt;code&gt;111122223333&lt;/code&gt;)&lt;/li&gt; 
   &lt;li&gt;&lt;strong&gt;Shared Database:&lt;/strong&gt; Your S3 Tables database (for example, &lt;code&gt;icebergsons3&lt;/code&gt;)&lt;/li&gt; 
   &lt;li&gt;&lt;strong&gt;Shared Database’s Catalog ID:&lt;/strong&gt; Your S3 Table bucket in the format &lt;code&gt;111122223333:s3tablescatalog/redshifticeberg&lt;/code&gt;&lt;/li&gt; 
  &lt;/ul&gt; &lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/07/BDB-5744-2.png" alt="Resource link creation in Lake Formation with catalog ID and shared database configured" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;For more information, see &lt;a href="https://docs.aws.amazon.com/lake-formation/latest/dg/creating-resource-links.html" target="_blank" rel="noopener"&gt;Creating resource links&lt;/a&gt; in the Lake Formation documentation.&lt;/p&gt; 
&lt;h3&gt;Option A: External schema for IAM federated users&lt;/h3&gt; 
&lt;p&gt;If your users connect to Amazon Redshift through IAM federation, &lt;a href="https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_EXTERNAL_SCHEMA.html" target="_blank" rel="noopener"&gt;create the external schema&lt;/a&gt; with the &lt;code&gt;SESSION&lt;/code&gt; keyword. This passes the federated user’s credentials through to Lake Formation for access control:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;CREATE EXTERNAL SCHEMA s3tables_schema
FROM DATA CATALOG
DATABASE 's3tables_rl'
CATALOG_ID '111122223333'
IAM_ROLE 'SESSION'
CATALOG_ROLE 'SESSION';&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Lake Formation evaluates your permissions based on your federated user’s IAM role, and sees only the tables and columns their role allows. This is the recommended approach for new deployments because it provides fine-grained access control without additional role management.&lt;/p&gt; 
&lt;h3&gt;Option B: External schema for database users&lt;/h3&gt; 
&lt;p&gt;External applications like Tableau, PowerBI, and custom ETL tools often authenticate with database credentials instead of IAM federation. These users need an IAM role to access S3 Tables on their behalf.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Create an IAM service role to access S3 Tables:&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;You create a role (for example, &lt;code&gt;S3TableAccessRole&lt;/code&gt;) with a trust policy that allows Amazon Redshift to assume it:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "redshift.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;You then attach the following permission policies to the role:&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;A policy for Lake Formation data access (substitute your 12-digit AWS Account ID for&lt;/em&gt; &lt;code&gt;YOUR_ACCOUNT_ID&lt;/code&gt;&lt;em&gt;):&lt;/em&gt;&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "lakeformation:GetDataAccess",
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "YOUR_ACCOUNT_ID"
                }
            }
        },
        {
            "Effect": "Deny",
            "Action": "lakeformation:PutDataLakeSettings",
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "YOUR_ACCOUNT_ID"
                }
            }
        }
    ]
}&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;A policy for &lt;em&gt;&lt;a href="https://docs.aws.amazon.com/glue/latest/dg/catalog-and-crawler.html" target="_blank" rel="noopener"&gt;AWS Glue Data Catalog&lt;/a&gt; access (substitute the appropriate AWS Region for&lt;/em&gt; &lt;code&gt;REGION_ID&lt;/code&gt; &lt;em&gt;and your 12-digit AWS Account ID for&lt;/em&gt; &lt;code&gt;YOUR_ACCOUNT_ID&lt;/code&gt;&lt;em&gt;):&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;For production, scope these permissions to your specific resources and AWS Region.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "glue:GetTable",
                "glue:GetTables",
                "glue:GetTableVersion",
                "glue:GetTableVersions",
                "glue:GetTags"
            ],
            "Resource": [
                "arn:aws:glue:REGION_ID:YOUR_ACCOUNT_ID:catalog",
                "arn:aws:glue:REGION_ID:YOUR_ACCOUNT_ID:database/*",
                "arn:aws:glue:REGION_ID:YOUR_ACCOUNT_ID:table/*/*"
            ]
        }
    ]
}&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;&lt;strong&gt;Grant Lake Formation permissions to the role:&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;In the Lake Formation console, grant the &lt;code&gt;S3TableAccessRole&lt;/code&gt; DESCRIBE access on the database and SELECT access on the tables for your resource link. For detailed steps, see &lt;a href="https://docs.aws.amazon.com/lake-formation/latest/dg/granting-lake-formation-permissions.html" target="_blank" rel="noopener"&gt;Granting Lake Formation permissions&lt;/a&gt;.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/07/BDB-5744-3.png" alt="Lake Formation DESCRIBE permission on resource link database" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/07/BDB-5744-4.png" alt="Lake Formation SELECT permission on tables" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Associate the role and create the schema:&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;First, associate the IAM role with your Amazon Redshift cluster or workgroup. For instructions, see &lt;a href="https://docs.aws.amazon.com/redshift/latest/mgmt/authorizing-redshift-service.html" target="_blank" rel="noopener"&gt;Associating IAM roles with Amazon Redshift&lt;/a&gt;.&lt;/p&gt; 
&lt;p&gt;Create the external schema:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;CREATE EXTERNAL SCHEMA s3tables_schema
FROM DATA CATALOG
DATABASE 's3tables_rl'
IAM_ROLE 'arn:aws:iam::111122223333:role/S3TableAccessRole';&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Then grant access to your database users:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;GRANT USAGE ON SCHEMA s3tables_schema TO my_database_user;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;h3&gt;Query with two-part notation&lt;/h3&gt; 
&lt;p&gt;With either option, you can now query S3 Tables using the simpler two-part notation:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;SELECT * FROM s3tables_schema.examples LIMIT 10;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/07/BDB-5744-5.png" alt="Query results showing two-part notation returning rows from the examples table" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;You can use this notation in BI tools, JDBC/ODBC connections, and application code and no longer need to know the underlying catalog structure.&lt;/p&gt; 
&lt;h2&gt;Accelerate queries with materialized views&lt;/h2&gt; 
&lt;p&gt;When you repeatedly query S3 Tables, each execution scans the external data from S3. Materialized views store pre-computed results in Amazon Redshift, so subsequent queries read from local storage instead of scanning S3 on every run.&lt;/p&gt; 
&lt;p&gt;Redshift supports incremental refresh for materialized views on Apache Iceberg tables, including INSERT, DELETE, UPDATE, and table compaction operations. After the initial creation, Amazon Redshift processes only the rows that changed since the last refresh when you run subsequent refreshes, rather than recomputing the full result set. This helps reduce both the time and compute cost of keeping your views current, especially for large tables with frequent changes.&lt;/p&gt; 
&lt;p&gt;Materialized views have general limitations and considerations when used with external data lake tables. For details, see &lt;a href="https://docs.aws.amazon.com/redshift/latest/dg/materialized-view-external-table.html" target="_blank" rel="noopener"&gt;Materialized views on external data lake tables&lt;/a&gt;.&lt;/p&gt; 
&lt;h3&gt;Create a materialized view on S3 Tables&lt;/h3&gt; 
&lt;p&gt;The following example &lt;a href="https://docs.aws.amazon.com/redshift/latest/dg/materialized-view-create-sql-command.html" target="_blank" rel="noopener"&gt;creates a materialized view&lt;/a&gt; that joins the &lt;code&gt;examples&lt;/code&gt; table in S3 Tables with a local &lt;code&gt;categories&lt;/code&gt; table in Amazon Redshift. You can use a materialized view to pre-compute daily record counts and data samples per category:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;CREATE MATERIALIZED VIEW mv_daily_category_summary
DISTSTYLE KEY
DISTKEY (category_id)
SORTKEY (insert_date)
AS
SELECT
    c.category_id,
    c.department,
    e.insert_date,
    COUNT(*) AS record_count,
    COUNT(DISTINCT e.id) AS unique_ids
FROM s3tables_schema.examples e
JOIN public.categories c
  ON c.category_id = e.category_id
GROUP BY c.category_id, c.department, e.insert_date;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Query the materialized view directly:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;SELECT category_id, department, insert_date, record_count
FROM mv_daily_category_summary
ORDER BY record_count DESC
LIMIT 10;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Your query can now read from local Amazon Redshift storage and typically returns results without scanning S3 Tables:&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/07/BDB-5744-6.png" alt="Query results from the materialized view showing category data with record counts" width="600"&gt;&lt;/p&gt; 
&lt;h3&gt;Refresh strategies&lt;/h3&gt; 
&lt;p&gt;You have two options for keeping materialized views current:&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;Automatic refresh:&lt;/em&gt; Set &lt;code&gt;AUTO REFRESH YES&lt;/code&gt; in the view definition to have Amazon Redshift automatically refresh the view in the background when it detects changes to the base tables. This is a good fit for dashboards and reports that can tolerate a short delay between data changes and query results. Note that automatic refresh requires Option B (database user) when creating the external schema, and the default is &lt;code&gt;AUTO REFRESH NO&lt;/code&gt;.&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;Manual refresh:&lt;/em&gt; Run &lt;code&gt;REFRESH MATERIALIZED VIEW&lt;/code&gt; when you need to control the timing:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;REFRESH MATERIALIZED VIEW mv_daily_category_summary;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Use manual refresh when you need to coordinate updates with data loading pipelines or when you want to refresh during off-peak hours.&lt;/p&gt; 
&lt;h2&gt;Tune S3 Tables compaction for your query patterns&lt;/h2&gt; 
&lt;p&gt;S3 Tables automatically compacts small Parquet files into larger ones in the background. This compaction reduces the number of read requests your query engine must make, which can improve query performance. By default, compaction targets a file size of 512 MB, configurable between 64 MB and 512 MB. Four compaction strategies are available, and choosing the right one for your query patterns can make a measurable difference.&lt;/p&gt; 
&lt;h3&gt;Compaction strategies&lt;/h3&gt; 
&lt;table border="1px" width="100%" cellpadding="10px"&gt; 
 &lt;tbody&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;strong&gt;Strategy&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;When to use&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;How it works&lt;/strong&gt;&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;strong&gt;Auto&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;You want S3 to decide for you&lt;/td&gt; 
   &lt;td&gt;Selects sort compaction for sorted tables, binpack for unsorted tables&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;strong&gt;Binpack&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;General-purpose workloads, unsorted tables&lt;/td&gt; 
   &lt;td&gt;Combines small files into larger files (100 MB+) and applies pending row-level deletes&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;strong&gt;Sort&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;Queries frequently filter on a single column (e.g., &lt;code&gt;insert_date&lt;/code&gt;)&lt;/td&gt; 
   &lt;td&gt;Organizes data by the table’s sort-order columns during compaction&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;strong&gt;Z-order&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;Queries filter on two or more columns together (e.g., &lt;code&gt;insert_date&lt;/code&gt; and &lt;code&gt;category_id&lt;/code&gt;)&lt;/td&gt; 
   &lt;td&gt;Blends multiple column values into a single scalar for sorting&lt;/td&gt; 
  &lt;/tr&gt; 
 &lt;/tbody&gt; 
&lt;/table&gt; 
&lt;p&gt;Binpack improves performance by reducing the number of files a query engine reads. Sort compaction goes further. By ordering data within files, it enables query engines to skip entire files based on column min/max metadata during predicate pushdown. This is effective for queries that filter on the sort column, such as date-range filters. Z-order extends this benefit to queries that filter on multiple columns simultaneously, at the cost of slightly less efficient pruning on any single column compared to a pure sort.&lt;/p&gt; 
&lt;p&gt;To use sort or z-order compaction, you first need to verify that the table is sorted by one (sort) or multiple (z-order) columns:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;-- Sort
ALTER TABLE icebergsons3.examples WRITE ORDERED BY insert_date;

-- Z-Order
ALTER TABLE icebergsons3.examples WRITE ORDERED BY insert_date,category_id;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;h3&gt;Configure a compaction strategy&lt;/h3&gt; 
&lt;p&gt;To change the compaction strategy for a table, use the &lt;code&gt;PutTableMaintenanceConfiguration&lt;/code&gt; API through the &lt;a href="https://aws.amazon.com/cli/" target="_blank" rel="noopener"&gt;AWS Command Line Interface (AWS CLI)&lt;/a&gt;:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;aws s3tables put-table-maintenance-configuration \
    --table-bucket-arn arn:aws:s3tables:us-east-1:111122223333:bucket/redshifticeberg \
    --type icebergCompaction \
    --namespace icebergsons3 \
    --name examples \
    --value '{"status":"enabled","settings":{"icebergCompaction":{"strategy":"sort"}}}'&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;To adjust the target file size (for example, to 256 MB):&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;aws s3tables put-table-maintenance-configuration \
    --table-bucket-arn arn:aws:s3tables:us-east-1:111122223333:bucket/redshifticeberg \
    --type icebergCompaction \
    --namespace icebergsons3 \
    --name examples \
    --value '{"status":"enabled","settings":{"icebergCompaction":{"targetFileSizeMB":256}}}'&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Similar to the “sort” example, you can specify &lt;code&gt;{"strategy":"z-order"}&lt;/code&gt; for z-order compaction.&lt;/p&gt; 
&lt;p&gt;For more detail on sort and z-order, see &lt;a href="https://aws.amazon.com/blogs/aws/new-improve-apache-iceberg-query-performance-in-amazon-s3-with-sort-and-z-order-compaction/" target="_blank" rel="noopener"&gt;Improve Apache Iceberg query performance in Amazon S3 with sort and z-order compaction&lt;/a&gt;.&lt;/p&gt; 
&lt;h3&gt;Snapshot management&lt;/h3&gt; 
&lt;p&gt;S3 Tables manage snapshots automatically. By default, it keeps a minimum of 1 snapshot and expires snapshots older than 120 hours (5 days). The snapshot retention is customized by setting &lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-maintenance.html#s3-tables-maintenance-snapshot" target="_blank" rel="noopener"&gt;minSnapshotsToKeep and maxSnapshotAgeHours&lt;/a&gt;. After a snapshot reaches the expiration time you configured in your retention settings, S3 Tables marks objects that only that snapshot references as noncurrent and removes them based on the unreferenced file removal policy.&lt;/p&gt; 
&lt;p&gt;You can adjust these settings if your workload needs more snapshots for time-travel queries or longer retention:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;aws s3tables put-table-maintenance-configuration \
    --table-bucket-arn arn:aws:s3tables:us-east-1:111122223333:bucket/redshifticeberg \
    --namespace icebergsons3 \
    --name examples \
    --type icebergSnapshotManagement \
    --value '{"status":"enabled","settings":{"icebergSnapshotManagement":{"minSnapshotsToKeep":10,"maxSnapshotAgeHours":2500}}}'&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Keep in mind that retaining more snapshots increases storage costs. If a materialized view references an expired snapshot, Amazon Redshift falls back to a full recompute on the next refresh. Therefore, snapshot retention can directly affect your materialized view refresh behavior. Balance snapshot retention with your materialized view refresh frequency to avoid unnecessary full recomputes.&lt;/p&gt; 
&lt;p&gt;For more information, see &lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-maintenance.html" target="_blank" rel="noopener"&gt;Maintenance for tables&lt;/a&gt; in the Amazon S3 documentation.&lt;/p&gt; 
&lt;h2&gt;Best practices&lt;/h2&gt; 
&lt;p&gt;&lt;strong&gt;Choose the right access pattern for your users.&lt;/strong&gt; Use IAM federation with &lt;code&gt;SESSION&lt;/code&gt; credentials for new applications and interactive users. Reserve the IAM role approach for BI tools and extract, transform, and load (ETL) pipelines that can’t integrate with IAM federation directly. Plan to migrate database users to federated access over time.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Match compaction strategy to query patterns.&lt;/strong&gt; Use sort compaction when your queries filter on a single column (such as date ranges). Use z-order when queries filter on two or more columns together. Stick with the auto default if your query patterns vary or you’re unsure.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Size materialized views for your refresh window.&lt;/strong&gt; Materialized views that join large external tables with local tables take longer to refresh. If your data changes frequently, keep the materialized view focused on the specific aggregations your dashboards need rather than materializing entire tables.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Coordinate snapshot retention with materialized view refresh.&lt;/strong&gt; If a materialized view references an expired Iceberg snapshot, Amazon Redshift performs a full recompute instead of an incremental refresh. Set your snapshot retention (&lt;code&gt;maxSnapshotAgeHours&lt;/code&gt;) longer than your materialized view refresh interval.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Monitor compaction with &lt;a href="https://aws.amazon.com/cloudtrail/" target="_blank" rel="noopener"&gt;AWS CloudTrail&lt;/a&gt;.&lt;/strong&gt; S3 Tables logs compaction operations as CloudTrail management events. Track these to verify that compaction runs on schedule and to identify tables that might benefit from a different strategy.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Balance performance gains against storage costs.&lt;/strong&gt; Materialized views store pre-computed results in Amazon Redshift, adding to your managed storage. Compaction reduces file counts, but z-order and sort compaction can increase overall storage because of data duplication across sort boundaries. Review your Amazon Redshift managed storage usage and S3 Tables storage metrics periodically to make sure the performance benefits justify the additional storage utilization.&lt;/p&gt; 
&lt;h2&gt;Troubleshooting&lt;/h2&gt; 
&lt;table border="1px" width="100%" cellpadding="10px"&gt; 
 &lt;tbody&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;strong&gt;Issue&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Resolution&lt;/strong&gt;&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;“Permission denied” when creating the external schema&lt;/td&gt; 
   &lt;td&gt;Verify the IAM role has &lt;code&gt;lakeformation:GetDataAccess&lt;/code&gt; permission. Confirm you associated the role with your Amazon Redshift cluster or workgroup. Also check that you granted the role access to the resource link database and its tables in Lake Formation.&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;“Schema not found” or “Database not found” errors&lt;/td&gt; 
   &lt;td&gt;Confirm the resource link name in Lake Formation matches the &lt;code&gt;DATABASE&lt;/code&gt; value in your &lt;code&gt;CREATE EXTERNAL SCHEMA&lt;/code&gt; statement. Verify the catalog ID format uses the pattern &lt;code&gt;account_id:s3tablescatalog/bucket_name&lt;/code&gt;.&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;“Table not found” when querying through the external schema&lt;/td&gt; 
   &lt;td&gt;Check that Lake Formation permissions include table-level access, not just database-level. Verify the table exists in the S3 Tables catalog by querying it through the auto-mounted catalog first.&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Materialized view refresh falls back to full recompute&lt;/td&gt; 
   &lt;td&gt;Check if the referenced Iceberg snapshot has expired. Increase &lt;code&gt;maxSnapshotAgeHours&lt;/code&gt; in the snapshot management configuration. Verify that the base table hasn’t exceeded 4 million position deletes in a single data file. Compaction resolves this.&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Queries on S3 Tables are slow after data loading&lt;/td&gt; 
   &lt;td&gt;Compaction runs on an automated schedule and may not have processed recent writes yet. Check CloudTrail for the latest compaction event. Verify the compaction strategy matches your query patterns. Switch from binpack to sort if you filter on specific columns.&lt;/td&gt; 
  &lt;/tr&gt; 
 &lt;/tbody&gt; 
&lt;/table&gt; 
&lt;h2&gt;Cleaning up&lt;/h2&gt; 
&lt;p&gt;To avoid ongoing costs, remove the resources you created in this walkthrough:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;-- Drop materialized views
DROP MATERIALIZED VIEW IF EXISTS mv_daily_category_summary;

-- Drop external schemas
DROP SCHEMA IF EXISTS s3tables_schema;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Also remove:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;The IAM role (&lt;code&gt;S3TableAccessRole&lt;/code&gt;) and its attached policies, if you created one for database users.&lt;/li&gt; 
 &lt;li&gt;The Lake Formation resource link and associated permissions.&lt;/li&gt; 
 &lt;li&gt;The &lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-buckets-delete.html" target="_blank" rel="noopener"&gt;S3 table bucket&lt;/a&gt;, if you no longer need the data.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;Conclusion&lt;/h2&gt; 
&lt;p&gt;In this post, we showed how to optimize S3 Tables queries from Amazon Redshift using three approaches: external schemas that simplify query syntax from three-part to two-part notation, making it easier for BI tools and end users to work with S3 Tables. We also covered materialized views for pre-computed analytical results that reduce repeated S3 scans, and S3 Tables compaction strategies tuned to your query patterns for more efficient file access.&lt;/p&gt; 
&lt;p&gt;For new applications, design your access layer with IAM federation and external schemas from the start. Use materialized views to accelerate repeated analytical queries that join S3 Tables with local Amazon Redshift data. Match your compaction strategy to how your team queries the data. Use sort compaction for date-range filters and z-order when queries filter on multiple columns at once. Furthermore, the same S3 tables you optimize here are also accessible from Amazon Athena, Amazon EMR, and third-party engines.&lt;/p&gt; 
&lt;p&gt;To learn more, see the &lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables.html" target="_blank" rel="noopener"&gt;Amazon S3 Tables documentation&lt;/a&gt;, &lt;a href="https://docs.aws.amazon.com/redshift/latest/dg/materialized-view-overview.html" target="_blank" rel="noopener"&gt;Materialized views in Amazon Redshift&lt;/a&gt;, and &lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-maintenance.html" target="_blank" rel="noopener"&gt;S3 Tables maintenance&lt;/a&gt;. We welcome your feedback in the comments.&lt;/p&gt; 
&lt;h2&gt;About the authors&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt; 
   &lt;p&gt;&lt;img loading="lazy" class="alignleft size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/07/BDB-5744-7.png" alt="Tom Romano" width="100" height="100"&gt;&lt;/p&gt; 
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Tom Romano&lt;/h3&gt; 
  &lt;p&gt;Tom Romano is a Senior Solutions Architect for AWS World Wide Public Sector based in Tampa, FL. He works with GovTech customers to build solutions using serverless architectures, generative AI, and modern data and DevOps practices. In his free time, Tom flies remote control model airplanes and enjoys vacationing with his family around Florida and the Caribbean.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt; 
   &lt;p&gt;&lt;img loading="lazy" class="alignleft size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/07/BDB-5744-8.png" alt="Satesh Sonti" width="100" height="100"&gt;&lt;/p&gt; 
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Satesh Sonti&lt;/h3&gt; 
  &lt;p&gt;Satesh Sonti is a Principal Analytics Specialist Solutions Architect based out of Atlanta, specializing in building enterprise data platforms, data warehousing, and analytics solutions. He has over 20 years of experience in building data assets and leading complex data platform programs for banking and insurance clients across the globe.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
		
		
			</item>
		<item>
		<title>Securing client confidentiality at scale: Automated data discovery and governed analytics for legal workloads</title>
		<link>https://aws.amazon.com/blogs/big-data/securing-client-confidentiality-at-scale-automated-data-discovery-and-governed-analytics-for-legal-workloads/</link>
					
		
		<dc:creator><![CDATA[Rohan Kamat]]></dc:creator>
		<pubDate>Wed, 13 May 2026 15:57:14 +0000</pubDate>
				<category><![CDATA[Advanced (300)]]></category>
		<category><![CDATA[Amazon Macie]]></category>
		<category><![CDATA[Amazon Quick Suite]]></category>
		<category><![CDATA[Amazon Simple Notification Service (SNS)]]></category>
		<category><![CDATA[Amazon Simple Storage Service (S3)]]></category>
		<category><![CDATA[Analytics]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[AWS Glue]]></category>
		<category><![CDATA[AWS Lake Formation]]></category>
		<category><![CDATA[AWS Security Hub]]></category>
		<category><![CDATA[Technical How-to]]></category>
		<guid isPermaLink="false">99e1696fc52f579762912b853853132c5d6dde6d</guid>

					<description>In this post, we show you a reference architecture that automates sensitive data discovery across legal document repositories on Amazon Web Services (AWS), demonstrate how to capture structured findings as a compliance dataset, and guide you through building a governed analytics workspace that maintains your security boundaries. You walk away with a practical model for building security and analytics into the same lifecycle, without moving documents outside their system of record.</description>
										<content:encoded>&lt;p&gt;Automating data security and analytics for legal documents presents a unique challenge when your legal team stores documents with strong access controls, organized by client and matter, encrypted at rest, and governed by well-defined policies. But what happens when you want to run analytics across those repositories? The typical path is extracting content into separate data pipelines or third-party tools, which fragments your governance model and introduces new risks. Law firms and corporate legal departments operate under distinct obligations that make data governance non-negotiable. Attorney-client privilege, work product doctrine, and professional conduct rules impose strict duties around how client information is handled, accessed, and disclosed. Governance failure in this context isn’t just a compliance gap, it can result in privilege waiver, disqualification from representation, or disciplinary action.&lt;/p&gt; 
&lt;p&gt;Legal professionals use &lt;em&gt;ethical walls&lt;/em&gt;, also called &lt;em&gt;information barriers&lt;/em&gt;, as structural safeguards that prevent the flow of confidential information between teams within a firm that represent adverse or potentially conflicting interests. Professional conduct rules mandate these barriers, and failure to maintain them can result in firm disqualification, malpractice liability, or regulatory sanctions.&lt;/p&gt; 
&lt;p&gt;Privilege boundaries are equally critical. Attorney-client privilege and work product protection apply only when you properly control access to the underlying material. If you expose privileged documents or metadata about their contents to unauthorized individuals, you risk losing your privilege protection. When organizations fail to maintain reasonable controls over privileged material, courts might find that they have waived their privilege. You should therefore actively manage your access governance, not only as a security concern but as a legal preservation requirement.When you extract content into separate analytics systems or grant broader access than your matter structures support, you create pressure on both protections. You gain visibility but lose confidence in your controls.&lt;/p&gt; 
&lt;p&gt;In this post, we show you a reference architecture that automates sensitive data discovery across legal document repositories on Amazon Web Services (AWS), demonstrate how to capture structured findings as a compliance dataset, and guide you through building a governed analytics workspace that maintains your security boundaries. You walk away with a practical model for building security and analytics into the same lifecycle, without moving documents outside their system of record.&lt;/p&gt; 
&lt;h2&gt;&lt;strong&gt;Analytics shouldn’t weaken governance&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;Most legal organizations have invested heavily in securing their document repositories. You store documents in structured storage, organized by client and matter. You access controls map to matter boundaries (the organizational and access structures that separate one client engagement from another). You establish retention and hold policies.The difficulty starts when teams want to analyze what’s inside those repositories. Running analytics typically means copying content into a separate system, standing up a new data pipeline, or granting broader access than existing matter structures support. Each of these steps introduces governance gaps. Manual reporting fills some of the void, but it doesn’t scale and can’t provide continuous visibility. What’s missing is a model where security controls and analytics reinforce each other, where the act of discovering sensitive data also produces the dataset that you use for reporting, and where governance applies once and carries through every downstream operation.&lt;/p&gt; 
&lt;p&gt;Automation addresses this by combining continuous sensitive data discovery with governed analytics, built on discovery metadata rather than document copies. This automated approach delivers four key advantages:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;No document movement.&lt;/strong&gt;&amp;nbsp;Your files stay in their system of record. Analytics runs against structured discovery metadata, not document content, so governance boundaries remain intact.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Continuous discovery instead of manual scanning.&lt;/strong&gt;&amp;nbsp;Automated classification identifies regulated and sensitive information on an ongoing basis, replacing periodic manual reviews with on demand visibility.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Unified governance.&lt;/strong&gt;&amp;nbsp;You define matter-aligned access policies once, and they carry through from document storage to findings analytics and compliance reporting.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Built-in audit readiness.&lt;/strong&gt;&amp;nbsp;A durable record of discovery findings and remediation actions accumulates automatically over time, giving you structured evidence for client reviews and regulatory inquiries.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;&lt;strong&gt;Reference Architecture&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;The following architecture shows how continuous discovery, governance, and compliance operations can work together without copying legal documents into analytics systems.&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90724" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/30/BDB-5730-image-1.png" alt="This reference architecture illustrates how law firms and corporate legal departments can automate sensitive data discovery and compliance analytics on AWS without moving documents outside their system of record" width="919" height="642"&gt;&lt;/p&gt; 
&lt;h2&gt;&lt;strong&gt;Architecture walkthrough&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;&lt;strong&gt;Store and protect documents in Amazon Simple Storage Service (Amazon S3)&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;Store your legal documents in&amp;nbsp;&lt;a href="https://aws.amazon.com/s3/" target="_blank" rel="noopener noreferrer"&gt;Amazon S3&lt;/a&gt;, which serves as the system of record for document content. Align your buckets and prefixes to client and matter structures so that access controls map directly to matter boundaries. Where your retention or legal hold requirements demand it, apply&amp;nbsp;&lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lock.html" target="_blank" rel="noopener noreferrer"&gt;S3 Object Lock&lt;/a&gt;&amp;nbsp;to enforce immutability. You can encrypt your data using&amp;nbsp;&lt;a href="https://aws.amazon.com/kms/" target="_blank" rel="noopener noreferrer"&gt;AWS Key Management Service (AWS KMS)&lt;/a&gt;, which gives you centralized control over encryption keys and policies.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Discover and classify sensitive data with Amazon Macie&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;You will configure&amp;nbsp;&lt;a href="https://aws.amazon.com/macie/" target="_blank" rel="noopener noreferrer"&gt;Amazon Macie&lt;/a&gt;&amp;nbsp;to continuously analyze your document repositories. Macie identifies regulated information such as personally identifiable information (PII), financial data, and other sensitive content and produces structured findings that describe what Macie identified and where it exists. This provides ongoing visibility into data exposure without requiring document movement or manual scanning.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Catalog and govern findings with AWS Glue and AWS Lake Formation&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;You will use&amp;nbsp;&lt;a href="https://aws.amazon.com/glue/" target="_blank" rel="noopener noreferrer"&gt;AWS Glue&lt;/a&gt;&amp;nbsp;to catalog the findings dataset and maintain its schema so it stays query-ready. Apply&amp;nbsp;&lt;a href="https://aws.amazon.com/lake-formation/" target="_blank" rel="noopener noreferrer"&gt;AWS Lake Formation&lt;/a&gt;&amp;nbsp;tag-based policies to govern access, aligning tags to client, matter, and confidentiality tier. This approach enforces ethical walls and least-privilege access consistently across analytics and reporting activities.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;AI-powered chat agent using Amazon Quick Suite&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;You can create custom chat agents to tailor conversational interfaces for specific legal business needs. These agents can be configured with legal-specific knowledge bases, connected to relevant document repositories, and customized with instructions appropriate for legal workflows. You can use this chat agent to interact with your legal documents through natural language conversation for capabilities like:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;b&gt;E-Discovery:&lt;/b&gt;Search and analyze large volumes of legal documents to quickly find relevant information across your document repository.&lt;/li&gt; 
 &lt;li&gt;&lt;b&gt;Contract Analysis:&lt;/b&gt;Review contracts and automatically extract key terms, clauses, and obligations to streamline your contract review process.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;The chat agent can help you navigate complex document sets through conversational queries, making legal research and document review more efficient and accessible.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Analyze and report with Amazon Quick Sight&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;You will use &lt;a href="https://aws.amazon.com/quick/" target="_blank" rel="noopener noreferrer"&gt;Amazon Quick&lt;/a&gt; as your compliance operations workspace. Quick provides a unified environment where your teams can query findings, generate dashboards, track remediation actions, and produce audit-ready reports. The agentic AI capabilities of Amazon Quick can autonomously build analyses, surface anomalies across matters, generate executive summaries for client reviews, and proactively recommend remediation priorities based on finding severity and trends. Combined with built-in data stories for automated narrative generation and pixel-perfect paginated reports for regulatory submissions, Quick reduces the time from discovery to action while keeping your teams within a governed interface aligned to matter-based permissions. Rather than switching between separate visualization, workflow, and reporting tools, your legal and compliance teams can review findings, manage response activities, and collaborate all within a single workspace that respects ethical walls and privilege boundaries.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Escalate high-severity findings&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;For high-severity findings that demand immediate attention, route alerts through&amp;nbsp;&lt;a href="https://aws.amazon.com/security-hub/" target="_blank" rel="noopener noreferrer"&gt;AWS Security Hub&lt;/a&gt;&amp;nbsp;or&amp;nbsp;&lt;a href="https://aws.amazon.com/sns/" target="_blank" rel="noopener noreferrer"&gt;Amazon Simple Notification Service (Amazon SNS)&lt;/a&gt;&amp;nbsp;to trigger escalation workflows. This connects visibility directly to action when your teams identify sensitive data risks.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Why this approach works for legal&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;Documents stay where they belong.&amp;nbsp;Your files remain in Amazon S3, aligned to client and matter boundaries. No content moves into separate analytics pipelines.Ethical walls remain intact.&amp;nbsp;Because analytics is built on discovery findings and not document copies, you can govern access to findings using the same matter-aligned controls that apply to documents. Compliance and security teams gain visibility without expanding document access.Discovery runs continuously, not periodically.&amp;nbsp;Rather than scheduling quarterly or annual scans, you maintain a current view of sensitive data across your repositories.&lt;/p&gt; 
&lt;p&gt;Governance applies once and carries through.&amp;nbsp;Lake Formation tag-based policies govern findings access at the catalog level. You define your matter and confidentiality mappings once, and they carry through to every dashboard, query, and report.Audit readiness is built in.&amp;nbsp;Instead of assembling reports manually before a client review or regulatory inquiry, you maintain a historical record of discovery findings and remediation actions. You can demonstrate your posture over time with consistent, structured evidence.&lt;/p&gt; 
&lt;p&gt;Security and analytics reinforce each other.&amp;nbsp;Your analytics capability is built on top of your security controls, not alongside them. Strengthening one strengthens the other.&lt;/p&gt; 
&lt;h2&gt;&lt;strong&gt;Cost considerations&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;The primary cost drivers for this architecture include:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Amazon Macie:&lt;/strong&gt;&amp;nbsp;You pay based on the number of S3 buckets evaluated and the volume of data inspected for sensitive data discovery. Review&amp;nbsp;&lt;a href="https://aws.amazon.com/macie/pricing/" target="_blank" rel="noopener noreferrer"&gt;Amazon Macie pricing&lt;/a&gt;&amp;nbsp;for current rates.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Amazon S3:&lt;/strong&gt;&amp;nbsp;Storage costs for both your document repositories and the compliance intelligence bucket. Consider S3 lifecycle policies to tier older findings into lower-cost storage classes.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;AWS Glue and AWS Lake Formation:&lt;/strong&gt;&amp;nbsp;Charges for crawlers and catalog storage. For most implementations, these costs are modest.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Amazon QuickSight:&lt;/strong&gt;&amp;nbsp;Per-user pricing based on the edition that you select (Standard or Enterprise). Enterprise edition supports row-level and column-level security, which aligns well with matter-based governance.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Amazon EventBridge, AWS Security Hub, and Amazon SNS:&lt;/strong&gt;&amp;nbsp;Charges based on event volume and notifications delivered. For findings-based workflows, these costs are generally low.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;Use the&amp;nbsp;&lt;a href="https://calculator.aws/" target="_blank" rel="noopener noreferrer"&gt;AWS Pricing Calculator&lt;/a&gt;&amp;nbsp;to estimate costs based on your repository size, user count, and discovery frequency.&lt;/p&gt; 
&lt;h2&gt;&lt;strong&gt;Getting started&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;Start by identifying a representative set of document repositories in Amazon S3. We recommend that you start with two or three matters that span different practice areas and confidentiality tiers.&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;&lt;strong&gt;Turn on Amazon Macie&lt;/strong&gt;&amp;nbsp;for those repositories and configure automated sensitive data discovery.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Catalog the findings dataset&lt;/strong&gt;&amp;nbsp;with AWS Glue and apply Lake Formation tag-based access policies aligned to your matter structure.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Build your first Amazon Quick Sight dashboard&lt;/strong&gt;&amp;nbsp;to visualize findings by matter, sensitivity type, and severity.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Define escalation rules&lt;/strong&gt;&amp;nbsp;in AWS Security Hub or Amazon SNS for high-severity findings.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;After you validate this workflow against your initial repositories, expand gradually. Add more repositories to Macie discovery. Refine your governance tags to reflect practice areas and confidentiality tiers. Extend your dashboards from basic posture visibility to trend analysis and remediation tracking.The goal isn’t to build a comprehensive analytics solution all at once. Start with a secure foundation where discovery findings, governance, and reporting operate together in a way that aligns with your legal workflows, and then expand from there.&lt;/p&gt; 
&lt;h2&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;You don’t have to choose between protecting client data and understanding it. By building analytics on top of governed discovery findings and using a unified compliance workspace, you gain visibility into your data posture without weakening confidentiality boundaries.This approach brings security, governance, and analytics together in a way that reflects how legal work is actually structured. It provides continuous visibility, supports audit readiness, and delivers insight without requiring documents to move outside their system of record.&lt;/p&gt; 
&lt;h3&gt;&lt;strong&gt;Next steps&lt;/strong&gt;&lt;/h3&gt; 
&lt;p&gt;Review the&amp;nbsp;&lt;a href="https://docs.aws.amazon.com/macie/latest/user/what-is-macie.html" target="_blank" rel="noopener noreferrer"&gt;Amazon Macie User Guide&lt;/a&gt;&amp;nbsp;to understand sensitive data discovery configuration options and &lt;a href="https://docs.aws.amazon.com/quicksight/" target="_blank" rel="noopener noreferrer"&gt;Amazon Quick Sight documentation&lt;/a&gt;&amp;nbsp;to evaluate dashboard and row-level security capabilities.&lt;/p&gt; 
&lt;p&gt;Contact your&amp;nbsp;&lt;a href="https://aws.amazon.com/contact-us/" target="_blank" rel="noopener noreferrer"&gt;AWS account team&lt;/a&gt;&amp;nbsp;to discuss implementation support for legal and compliance workloads.&lt;/p&gt; 
&lt;hr style="width: 80%"&gt; 
&lt;h2&gt;About the authors&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignnone size-full wp-image-90766" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/30/BDB-5730-image-2-1-e1777588892859.png" alt="Photo of Author - Rohan Kamat" width="100" height="116"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;&lt;strong&gt;Rohan Kamat&lt;/strong&gt;&lt;/h3&gt; 
  &lt;p&gt;Rohan Kamat is a Solutions Architecture Leader within HCLS with extensive experience in cloud architecture, cybersecurity, Identity and Access Management, and enterprise networking. Rohan focuses on helping architects build both depth in cloud technologies and strength in executive communication, making sure they can confidently guide organizations through business and technical transformation. Outside of his professional work, Rohan enjoys time with his family, organizing community cricket events, and exploring fitness and wellness activities like pickleball and ping pong. He also enjoys planning travel experiences that bring people together and create lasting shared memories.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignnone size-full wp-image-90726" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/30/BDB-5730-image-3.jpeg" alt="Photo of Author- Miguel Lopez Luis" width="100" height="133"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;&lt;strong&gt;Miguel Lopez Luis&lt;/strong&gt;&lt;/h3&gt; 
  &lt;p&gt;Miguel Lopez Luis is an AWS Solutions Architect who works with small and medium businesses across the United States. He graduated with a Bachelor’s degree in Cybersecurity from Bellevue University in Nebraska and is a member of the Omega Nu Lambda Honor Society. Leveraging his extensive expertise in business management, Miguel is passionate about planning strategic initiatives, leading cross-functional teams, and mentoring others. In his personal time, he enjoys activities that involve travel, sports, and cooking.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignnone size-full wp-image-90727" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/30/BDB-5730-image-4.jpeg" alt="Photo of Author - Pranali Khose" width="100" height="133"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;&lt;strong&gt;Pranali Khose&lt;/strong&gt;&lt;/h3&gt; 
  &lt;p&gt;Pranali Khose is an AWS Solutions Architect based in Seattle. She works directly with small and medium business (SMB) customers across the United States, to design and implement cloud solutions that address their unique business challenges and accelerate digital transformation. Pranali holds a Master of Science in Computer Science from the University of Texas at Arlington.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
					
			
		
		
			</item>
		<item>
		<title>Streamlined monitoring and debugging for Amazon EMR on EC2</title>
		<link>https://aws.amazon.com/blogs/big-data/streamlined-monitoring-and-debugging-for-amazon-emr-on-ec2/</link>
					
		
		<dc:creator><![CDATA[Parul Saxena]]></dc:creator>
		<pubDate>Tue, 12 May 2026 15:59:22 +0000</pubDate>
				<category><![CDATA[Amazon EMR]]></category>
		<category><![CDATA[Analytics]]></category>
		<category><![CDATA[AWS Big Data]]></category>
		<category><![CDATA[Best Practices]]></category>
		<category><![CDATA[Monitoring and observability]]></category>
		<category><![CDATA[Technical How-to]]></category>
		<guid isPermaLink="false">b8dc238b255037fa7d65ca154c626c0e662d22ab</guid>

					<description>In this post, we walk you through five key enhancements: Amazon CloudWatch Logs integration, step-level Amazon Simple Storage Service (Amazon S3) logging controls, expanded console UIs for YARN and Tez, Amazon EMR step to YARN application ID mapping, and enhanced custom metrics with updated documentation.</description>
										<content:encoded>&lt;p&gt;As organizations scale their data processing and analytics workloads on Amazon EMR on EC2, observability across cluster health, job execution, and resource usage becomes increasingly important. Teams often manage log collection across distributed nodes, correlate Amazon EMR steps with underlying YARN applications, and configure monitoring agents to capture the right level of detail for their environment.&lt;/p&gt; 
&lt;p&gt;With Amazon EMR release 7.11.0 and updates to the Amazon EMR console, Amazon EMR on EC2 introduces observability capabilities that streamline these workflows further. In this post, we walk you through five key enhancements: Amazon CloudWatch Logs integration, step-level Amazon Simple Storage Service (Amazon S3) logging controls, expanded console UIs for YARN and Tez, Amazon EMR step to YARN application ID mapping, and enhanced custom metrics with updated documentation.&lt;/p&gt; 
&lt;h2&gt;What’s new&lt;/h2&gt; 
&lt;p&gt;The following sections cover key improvements across the Amazon EMR console, logging, metrics collection, and documentation to give you deeper, end-to-end visibility into your Amazon EMR clusters and workloads.&lt;/p&gt; 
&lt;h3&gt;1. CloudWatch Logs integration&lt;/h3&gt; 
&lt;p&gt;Starting with Amazon EMR release 7.11.0, you can stream cluster logs to Amazon CloudWatch Logs in near real time without requiring custom bootstrap actions or manual agent configuration. With &lt;a href="https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-logging-cw.html" target="_blank" rel="noopener noreferrer"&gt;Amazon CloudWatch logging enabled&lt;/a&gt;, Amazon EMR automatically captures and streams Amazon EMR step execution logs, Spark driver, and Spark executor logs as they’re generated. This makes them immediately available for monitoring, troubleshooting, and post-mortem analysis through the CloudWatch console or API.&lt;/p&gt; 
&lt;p&gt;You can enable CloudWatch logging through the Amazon EMR console during cluster creation or programmatically using the AWS Command Line Interfaced (AWS CLI) and SDK by including the Amazon CloudWatch Agent in your application configuration and specifying your logging preferences in the configuration section.&lt;/p&gt; 
&lt;p&gt;With minimal configuration, Amazon EMR captures step logs and Spark driver logs by default, streaming them to a log group named &lt;code&gt;/aws/emr/{cluster_id}&lt;/code&gt;. For production workloads requiring stricter organizational and security controls, you can customize the log group name, define a log stream prefix for streamlined filtering, enable encryption with an AWS Key Management Service (AWS KMS) key, and explicitly select which log types to capture. The following example demonstrates a fully customized configuration:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;aws emr create-cluster
--name "EMR cluster with custom CloudWatch Logs"
--release-label emr-7.11.0
--applications Name=Spark Name=AmazonCloudWatchAgent
--instance-type m7g.2xlarge
--instance-count 3
--use-default-roles
--monitoring-configuration '
"CloudWatchLogConfiguration":
"Enabled": true,
"LogGroupName": "/my-company/emr/production",
"LogStreamNamePrefix": "cluster-prod",
"EncryptionKeyArn": "arn:aws:kms:us-east-1:123456789012:key/12345678-1234-1234-1234-123456789012",
"LogTypes": {
"STEP_LOGS": ["STDOUT", "STDERR"],
"SPARK_DRIVER": ["STDOUT", "STDERR"],
"SPARK_EXECUTOR": ["STDERR", "STDOUT"]
}
}
}'&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;This configuration directs the logs to a custom log group (&lt;code&gt;/my-company/emr/production&lt;/code&gt;), prefixes log stream names with &lt;code&gt;cluster-prod&lt;/code&gt; for consistent identification across clusters, encrypts log data at rest using the specified KMS key, and captures the full set of available log types: step stdout/stderr, Spark driver, and Spark executor output. Because logs are streamed to CloudWatch as they’re written, you have near real-time visibility into job execution without waiting for log aggregation to S3 or establishing direct connectivity to cluster nodes. Combined with CloudWatch Logs Insights, you can run structured querying across log streams, making it straightforward to trace failures, correlate errors across driver and executor logs, and build metric filters or alarms based on specific log patterns.&lt;/p&gt; 
&lt;h3&gt;2. Step-level S3 logging improvements&lt;/h3&gt; 
&lt;p&gt;S3 logging capabilities now provide &lt;a href="https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-debugging-step-log-customization.html" target="_blank" rel="noopener noreferrer"&gt;granular control over how step logs are organized and secured&lt;/a&gt;. You can now specify a dedicated S3 log destination and AWS KMS encryption key at the individual Amazon EMR step level. This allows different steps within the same cluster to write logs to separate S3 paths with independent encryption configurations. This is particularly useful for multi-tenant clusters or workflows with varying data classification requirements.&lt;/p&gt; 
&lt;p&gt;Step-level logging is configured through the &lt;code&gt;StepMonitoringConfiguration&lt;/code&gt; parameter, which accepts an &lt;code&gt;S3MonitoringConfiguration&lt;/code&gt; object where you can define the target S3 path and an AWS KMS key for encryption at rest:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;"StepMonitoringConfiguration": { "S3MonitoringConfiguration": { "LogUri": "s3://your-s3-bucket/", "EncryptionKeyArn": "arn:aws:kms:your-kms-key-arn" } }&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;This configuration is optional. When omitted, the step inherits the &lt;a href="https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-debugging.html" target="_blank" rel="noopener noreferrer"&gt;default S3 log path and encryption settings defined at the cluster level during creation&lt;/a&gt;. With this configuration, you can override logging behavior only for the steps that require it, while maintaining a consistent default for the rest of your workflow.&lt;/p&gt; 
&lt;h3&gt;3. Enhanced console with direct access to monitoring UIs&lt;/h3&gt; 
&lt;p&gt;Additional live application UIs are accessible directly from the Amazon EMR Console. These console-hosted interfaces remove the need to configure SSH (Secure Shell) tunnels, set up proxies, or establish any direct network connectivity to cluster nodes to reach application web UIs. The newly added interfaces include:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;YARN ResourceManager UI –&lt;/strong&gt; Monitor cluster-wide resource allocation, queue usage, and application lifecycle states across running and completed YARN applications. This interface also provides direct access to container-level logs for running YARN applications, enabling real-time debugging without requiring node-level access.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Tez UI –&lt;/strong&gt; Inspect Hive query execution plans, DAG visualizations, vertex-level performance metrics, and task-level counters for queries executed through the Tez execution engine (for example, Hive and Pig workloads).&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;These join the existing Spark History Server and YARN timeline interfaces already available through the console. By surfacing these UIs, administrators can grant developers and analysts visibility into cluster workloads and application diagnostics without exposing direct network access to cluster infrastructure while maintaining tighter security boundaries and preserving full observability into job execution and resource consumption.&lt;/p&gt; 
&lt;p&gt;With these additions, Amazon EMR now offers three complementary approaches to accessing application web interfaces, each suited to different operational requirements. Live Application UIs provide console-hosted access to web interfaces on running clusters. They’re recommended for environments where direct network connectivity to cluster nodes must be restricted from end users. &lt;a href="https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-web-interfaces.html" target="_blank" rel="noopener noreferrer"&gt;On-Cluster Web UIs&lt;/a&gt; offer full, unrestricted access to the complete set of native application web interfaces running on cluster nodes, suited for administrators and engineers who require deep, low-level visibility. &lt;a href="https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-cluster-application-history.html" target="_blank" rel="noopener noreferrer"&gt;Persistent Web UIs&lt;/a&gt; retain application-level data beyond cluster lifetime, so you can analyze and troubleshoot workloads on terminated clusters. Together, these options give you the flexibility to balance security boundaries, access scope, and data retention based on your team’s specific monitoring and debugging workflows.&lt;/p&gt; 
&lt;h3&gt;4. EMR step to YARN application ID mapping&lt;/h3&gt; 
&lt;p&gt;The Amazon EMR console now surfaces the &lt;a href="https://docs.aws.amazon.com/emr/latest/ManagementGuide/debug-emr-yarn.html" target="_blank" rel="noopener noreferrer"&gt;YARN Application ID directly within the EMR step&lt;/a&gt; details panel. For each step executing a Spark, Hive, or other YARN-based workload, the console displays the submitted YARN Application ID associated with that step, establishing a direct link between the EMR step abstraction and the underlying YARN application. With this mapping, you can:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Directly correlate EMR steps to YARN applications – &lt;/strong&gt;when a step fails or exhibits unexpected behavior, you can immediately identify the exact YARN application to investigate rather than manually cross-referencing timestamps or job names across interfaces.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Access live monitoring tools –&lt;/strong&gt; with the YARN application ID readily available, you can navigate directly to the YARN ResourceManager Live UI or the Spark History Server to inspect resource consumption, task-level execution details, and application state for both running and completed jobs.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Retrieve logs for detailed troubleshooting – &lt;/strong&gt;the application ID serves as the key lookup for retrieving container-level logs persisted to Amazon S3, significantly reducing the time to root-cause failures or diagnose performance regressions.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;To use this feature, open the &lt;strong&gt;Steps&lt;/strong&gt; tab on your Amazon EMR cluster detail page and select the step that you want to investigate. The YARN Application ID appears in the step details panel. From there, you can use the ID to navigate to the YARN ResourceManager Live UI at &lt;code&gt;&lt;a href="http://resourcemanager-host:8088/cluster/app/%3capplication_id" target="_blank" rel="noopener noreferrer"&gt;http://resourcemanager-host:8088/cluster/app/&amp;lt;application_id&lt;/a&gt;&amp;gt;&lt;/code&gt;, open the corresponding view in the Spark History Server, or locate the associated container logs in your configured S3 log destination.&lt;/p&gt; 
&lt;h3&gt;5. Enhanced custom metrics and observability documentation&lt;/h3&gt; 
&lt;p&gt;By default, Amazon EMR automatically sends cluster-level metrics to &lt;a href="https://docs.aws.amazon.com/emr/latest/ManagementGuide/UsingEMR_ViewingMetrics.html" target="_blank" rel="noopener noreferrer"&gt;Amazon CloudWatch at five-minute intervals&lt;/a&gt;, covering YARN application states, node health, HDFS utilization, and I/O activity. With Amazon EMR Release 7.0 and later, enabling the &lt;a href="https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-AmazonCloudWatchAgent.html" target="_blank" rel="noopener noreferrer"&gt;Amazon CloudWatch Agent&lt;/a&gt; extends this baseline with additional detailed metrics collected at one-minute intervals across cluster nodes. Furthermore, Amazon EMR 7.1 introduced custom metric classifications that you can use to define precisely which component-level metrics to collect from Hadoop, YARN, and HBase subsystems, like DataNode I/O activity, NodeManager JVM heap utilization, container resource consumption, and HBase performance counters. Each classification supports configurable export intervals, giving you control over collection granularity based on your monitoring requirements.&lt;/p&gt; 
&lt;p&gt;After enabled, custom metrics are accessible directly from the &lt;strong&gt;Monitoring&lt;/strong&gt; tab in the Amazon EMR console, where you can use a classification filter to switch between HDFS, YARN, HBase custom metric groupings that you’ve defined. Metric configurations can also be updated on running clusters through the console’s reconfiguration workflow, so you can adapt your monitoring strategy as workload requirements evolve without cluster downtime. For environments using Prometheus, metrics can also be forwarded to Amazon Managed Service for Prometheus and visualized through Grafana dashboards.&lt;/p&gt; 
&lt;p&gt;The following documentation and tutorials are available to help you get the most out of these capabilities:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/emr/latest/ManagementGuide/enhanced-custom-metrics.html" target="_blank" rel="noopener noreferrer"&gt;&lt;strong&gt;Enhanced Custom Metrics Guide&lt;/strong&gt;&lt;/a&gt; provides step-by-step instructions for configuring CloudWatch Agent to publish custom metrics.&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-metrics-observability.html" target="_blank" rel="noopener noreferrer"&gt;&lt;strong&gt;EMR Observability Best Practices&lt;/strong&gt;&lt;/a&gt; provides a comprehensive guide covering monitoring strategies, metric selection, and troubleshooting workflows.&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/emr/latest/ManagementGuide/enhanced-custom-metrics-application-status.html" target="_blank" rel="noopener noreferrer"&gt;&lt;strong&gt;Service Status Monitoring&lt;/strong&gt;&lt;/a&gt; provides a tutorial on monitoring and publishing Amazon EMR application status.&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/emr/latest/ManagementGuide/enhanced-custom-metrics-applications.html" target="_blank" rel="noopener noreferrer"&gt;&lt;strong&gt;Monitor Apache Spark applications on Amazon EMR with Amazon CloudWatch&lt;/strong&gt;&lt;/a&gt; provides a tutorial to publish detailed Spark metrics to CloudWatch and identify performance bottlenecks in Spark application.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;Getting started&lt;/h2&gt; 
&lt;p&gt;These observability improvements are available now for Amazon EMR on EC2. To get started:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;&lt;strong&gt;CloudWatch Logs integration and step-level log configuration&lt;/strong&gt;: To use these capabilities, launch a new cluster with Amazon EMR release 7.11.0 or later.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;For console enhancements&lt;/strong&gt;: Navigate to your existing Amazon EMR clusters in the AWS Console to access Live Application UI links and YARN Application ID mappings in step details, with no additional configuration required.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;For custom metrics&lt;/strong&gt;: Review our &lt;a href="http://docs.aws.amazon.com/emr/latest/ManagementGuide/enhanced-custom-metrics.html" target="_blank" rel="noopener noreferrer"&gt;Enhanced Custom Metrics documentation&lt;/a&gt; to configure the CloudWatch Agent for publishing Hadoop, YARN, and HBase component metrics using custom classification files.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;h2&gt;Conclusion&lt;/h2&gt; 
&lt;p&gt;With these enhancements, Amazon EMR on EC2 provides deeper visibility into cluster health, job execution, and resource usage, helping you reduce time to root cause and focus on delivering value from your data. Note that enabling CloudWatch Logs integration and custom metrics incurs additional CloudWatch charges based on log ingestion volume and metric publishing frequency.&lt;/p&gt; 
&lt;p&gt;If you have feedback or questions, reach out to your AWS account team or post on the &lt;a href="https://repost.aws/" target="_blank" rel="noopener noreferrer"&gt;AWS re:Post&lt;/a&gt;.&lt;/p&gt; 
&lt;hr style="width: 80%"&gt; 
&lt;h2&gt;About the authors&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="size-full wp-image-91037 alignleft" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/06/parul.jpg" alt="" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;p&gt;&lt;strong&gt;Parul Saxena&lt;/strong&gt;&lt;br&gt; &lt;a href="https://www.linkedin.com/in/parulsaxena27/" target="_blank" rel="noopener noreferrer"&gt;Parul&lt;/a&gt; is a Senior Big Data Specialist Solutions Architect at Amazon Web Services (AWS). She helps customers and partners build highly optimized, scalable, and secure solutions. She specializes in Amazon EMR, Amazon Athena, and AWS Lake Formation, providing architectural guidance for complex big data workloads and assisting organizations in modernizing their architectures and migrating analytics workloads to AWS.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="size-full wp-image-72477 alignleft" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2024/11/19/ravi-kumar.png" alt="" width="100" height="130"&gt;
  &lt;/div&gt; 
  &lt;p&gt;&lt;strong&gt;Ravi Kumar Singh&lt;/strong&gt;&lt;br&gt; &lt;a href="https://www.linkedin.com/in/ravikumarsingh19/" target="_blank" rel="noopener noreferrer"&gt;Ravi Kumar Singh&lt;/a&gt; is a Senior Product Manager Technical-ES (PMT) at Amazon Web Services, specializing in exabyte-scale data infrastructure and analytics platforms. He helps customers unlock insights from their data using open-source technologies and cloud computing for AI/ML use cases. Outside of work, Ravi enjoys exploring emerging trends in data science and machine learning.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="size-full wp-image-33305 alignleft" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/08/16/lorenzo-ripani.png" alt="" width="100" height="133"&gt;
  &lt;/div&gt; 
  &lt;p&gt;&lt;strong&gt;Lorenzo Ripani&lt;/strong&gt;&lt;br&gt; &lt;a href="https://www.linkedin.com/in/ripani" target="_blank" rel="noopener noreferrer"&gt;Lorenzo Ripani&lt;/a&gt; is a Big Data Solution Architect at AWS. He is passionate about distributed systems, open-source technologies, and security. He spends most of his time working with customers around the world to design, evaluate and optimize scalable and secure data pipelines with Amazon EMR.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="size-full wp-image-91038 alignleft" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/06/Arun.jpg" alt="" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;p&gt;&lt;strong&gt;Arun Prabakaran&lt;/strong&gt;&lt;br&gt; &lt;a href="https://www.linkedin.com/in/arprab/" target="_blank" rel="noopener noreferrer"&gt;Arun Prabakaran&lt;/a&gt; is a Senior Software Engineer working at AWS. His expertise spans distributed data processing and large-scale systems. He is passionate about building reliable data platforms and enabling organizations to run analytics and AI workloads at scale.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="size-full wp-image-91039 alignleft" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/06/Jason.jpg" alt="" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;p&gt;&lt;strong&gt;Jason Zou&lt;/strong&gt;&lt;br&gt; &lt;a href="https://www.linkedin.com/in/jasonpzou/" target="_blank" rel="noopener noreferrer"&gt;Jason Zou&lt;/a&gt; is a Software Development Engineer at Amazon Web Services, where he works on internal infrastructure supporting EMR clusters. He is passionate about building scalable, fault-tolerant distributed systems. Outside of work, he enjoys photography and playing basketball.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="size-full wp-image-91040 alignleft" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/06/Justin.jpg" alt="" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;p&gt;&lt;strong&gt;Justin Mae&lt;/strong&gt;&lt;br&gt; Justin Mae is a Software Development Engineer on the Amazon EMR team at Amazon Web Services. He works on EMR on EC2’s control plane, building systems that improve cluster performance, observability, and operational reliability.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
					
			
		
		
			</item>
		<item>
		<title>Detect and resolve HBase inconsistencies faster with AI on Amazon EMR</title>
		<link>https://aws.amazon.com/blogs/big-data/detect-and-resolve-hbase-inconsistencies-faster-with-ai-on-amazon-emr/</link>
					
		
		<dc:creator><![CDATA[Yu-Ting Su]]></dc:creator>
		<pubDate>Tue, 12 May 2026 15:56:41 +0000</pubDate>
				<category><![CDATA[Amazon EMR]]></category>
		<category><![CDATA[Amazon OpenSearch Service]]></category>
		<category><![CDATA[Analytics]]></category>
		<category><![CDATA[Customer Solutions]]></category>
		<category><![CDATA[Experience-Based Acceleration]]></category>
		<category><![CDATA[Kiro]]></category>
		<category><![CDATA[Amazon OpenSearch]]></category>
		<category><![CDATA[Apache HBase]]></category>
		<category><![CDATA[EMR]]></category>
		<guid isPermaLink="false">db9396e58fef011dcd89131f3ab418a1d8dd68f3</guid>

					<description>In this post, we show you how to build an AI-powered troubleshooting solution using Amazon OpenSearch Service vector search and intelligent analysis. This solution reduces HBase inconsistency resolution from hours to minutes and root cause identification from days to hours through natural language queries over operational data. This democratizes HBase troubleshooting capabilities across teams and reducing dependency on specialized expertise.</description>
										<content:encoded>&lt;p&gt;&lt;a href="https://hbase.apache.org/book.html" target="_blank" rel="noopener"&gt;HBase&lt;/a&gt; operations teams spend hours manually correlating logs, metadata, and consistency reports to identify root causes. Traditional approaches require deep expertise and extensive investigation across scattered data sources, directly impacting MTTR and operational efficiency. As HBase deployments scale and expertise becomes increasingly scarce, organizations face mounting pressure to maintain service reliability while managing growing operational complexity. The manual nature of troubleshooting creates bottlenecks that delay incident resolution, increase operational costs, and risk service degradation during critical business periods.&lt;/p&gt; 
&lt;p&gt;In this post, we show you how to build an AI-powered troubleshooting solution using &lt;a href="https://aws.amazon.com/opensearch-service/" target="_blank" rel="noopener"&gt;Amazon OpenSearch Service&lt;/a&gt; vector search and intelligent analysis. This solution reduces HBase inconsistency resolution from hours to minutes and root cause identification from days to hours through natural language queries over operational data. This democratizes HBase troubleshooting capabilities across teams and reducing dependency on specialized expertise.&lt;/p&gt; 
&lt;h2&gt;Solution overview&lt;/h2&gt; 
&lt;p&gt;The solution addresses HBase troubleshooting challenges through data processing, vector search, and AI-powered analysis. It processes operational data from &lt;a href="https://aws.amazon.com/emr/" target="_blank" rel="noopener noreferrer"&gt;Amazon EMR&lt;/a&gt; clusters, generates semantic vector embeddings, and enables natural language queries for intelligent troubleshooting.&lt;br&gt; Key components include:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Amazon EMR HBase:&lt;/strong&gt; Runs HBase workloads with &lt;a href="https://aws.amazon.com/s3/" target="_blank" rel="noopener noreferrer"&gt;Amazon S3&lt;/a&gt; as the HBase rootdir for durable, scalable storage&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Data Processing&lt;/strong&gt;: Extracts and processes HBase logs, &lt;a href="https://github.com/apache/hbase-operator-tools/blob/master/hbase-hbck2/README.md" target="_blank" rel="noopener noreferrer"&gt;HBCK&lt;/a&gt; reports, and metadata with vector embeddings&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Amazon OpenSearch Service&lt;/strong&gt;: Provides vector search capabilities with k-NN algorithms for semantic analysis&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;AI Analysis Interface&lt;/strong&gt;: Enables natural language queries with context-aware recommendations&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Custom Knowledge Base&lt;/strong&gt;: Supports organization-specific runbooks and troubleshooting procedures by ingesting Git repositories via &lt;a href="https://kiro.dev/" target="_blank" rel="noopener noreferrer"&gt;Kiro CLI&lt;/a&gt;‘s &lt;code&gt;/knowledge add&lt;/code&gt; command, enabling the AI assistant to reference custom operational guides alongside HBase source code and operational tools&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-55491.png" target="_blank" rel="noopener"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90179" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-55491.png" alt="AWS cloud architecture diagram showing an HBase log analysis system with EMR cluster, VPC networking, IAM roles, Lambda functions, OpenSearch domain, and supporting services for scalable log processing and analytics." width="1000" height="573"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;p&gt;The preceding diagram illustrates how the HBase log analysis system troubleshoots inconsistencies through automated workflows across AWS services.&lt;/p&gt; 
&lt;p&gt;When an operations team needs to investigate HBase issues, the engineer connects over SSH to the Amazon EMR primary node and runs the error collection script, which gathers logs from HBase master and RegionServer nodes and uploads them to Amazon S3. Next, the engineer connects to the Analytics &lt;a href="https://aws.amazon.com/ec2/" target="_blank" rel="noopener"&gt;Amazon Elastic Compute Cloud (Amazon EC2)&lt;/a&gt; instance and executes the automated processing script, which downloads logs from Amazon S3, generates semantic vector embeddings, and injects them into Amazon OpenSearch Service for k-NN-based semantic search. The engineer then queries the Kiro CLI AI Assistant using natural language to investigate. Kiro searches Amazon OpenSearch Service for relevant log entries and uses &lt;a href="https://aws.amazon.com/bedrock/" target="_blank" rel="noopener"&gt;Amazon Bedrock&lt;/a&gt; to analyze patterns, correlate errors across components, and provide actionable recommendations. This reduces troubleshooting time from hours to minutes. The system operates within an &lt;a href="https://aws.amazon.com/vpc/" target="_blank" rel="noopener"&gt;Amazon Virtual Private Cloud (Amazon VPC)&lt;/a&gt; with private subnets for Amazon EMR and Analytics &lt;a href="https://aws.amazon.com/ec2/" target="_blank" rel="noopener"&gt;Amazon EC2&lt;/a&gt;, &lt;a href="https://aws.amazon.com/iam/" target="_blank" rel="noopener"&gt;AWS Identity and Access Management (AWS IAM)&lt;/a&gt; roles for access control, Parameter Store for configuration, and &lt;a href="https://aws.amazon.com/cloudwatch/" target="_blank" rel="noopener"&gt;Amazon CloudWatch&lt;/a&gt; for monitoring.&lt;/p&gt; 
&lt;h2&gt;Prerequisites&lt;/h2&gt; 
&lt;p&gt;For this walkthrough, you need the following prerequisites:&lt;/p&gt; 
&lt;h3&gt;AWS account setup&lt;/h3&gt; 
&lt;ul&gt; 
 &lt;li&gt;An &lt;a href="https://signin.aws.amazon.com/signin?redirect_uri=https%3A%2F%2Fportal.aws.amazon.com%2Fbilling%2Fsignup%2Fresume&amp;amp;client_id=signup" target="_blank" rel="noopener noreferrer"&gt;AWS account&lt;/a&gt; with administrative access for initial deployment&lt;/li&gt; 
 &lt;li&gt;AWS Command Line Interface (AWS CLI) configured with administrative credentials&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h3&gt;Required AWS IAM permissions&lt;/h3&gt; 
&lt;h4&gt;For infrastructure deployment&lt;/h4&gt; 
&lt;p&gt;Your deployment user or role needs the following permissions:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Your deployment user or role requires sufficient access to &lt;a href="https://aws.amazon.com/cloudformation/" target="_blank" rel="noopener noreferrer"&gt;AWS CloudFormation&lt;/a&gt;, Amazon S3, AWS IAM, and &lt;a href="https://aws.amazon.com/systems-manager/" target="_blank" rel="noopener noreferrer"&gt;AWS System Manager&lt;/a&gt;.&lt;/li&gt; 
 &lt;li&gt;The user or role must have the ability to create AWS CloudFormation stacks.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;&lt;strong&gt;Infrastructure deployment:&lt;/strong&gt;&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;For infrastructure deployment, you need AWS CloudFormation stack management permissions.&lt;/li&gt; 
 &lt;li&gt;You also require sufficient access to create and manage the following resources: 
  &lt;ul&gt; 
   &lt;li&gt;Amazon OpenSearch Service domains&lt;/li&gt; 
   &lt;li&gt;Amazon EC2 instances, Amazon VPCs, security groups, and networking components&lt;/li&gt; 
   &lt;li&gt;AWS IAM roles and policies&lt;/li&gt; 
   &lt;li&gt;AWS Systems Manager Parameter Store entries&lt;/li&gt; 
   &lt;li&gt;Amazon CloudWatch Logs groups&lt;/li&gt; 
   &lt;li&gt;Amazon S3 bucket for access logs and session logs&lt;/li&gt; 
  &lt;/ul&gt; &lt;/li&gt; 
&lt;/ul&gt; 
&lt;h4&gt;Runtime service roles&lt;/h4&gt; 
&lt;p&gt;The AWS CloudFormation stack automatically creates two specialized AWS IAM roles designed with least-privilege access principles.&lt;/p&gt; 
&lt;p&gt;The first role is the Amazon OpenSearch Service Role, which manages Amazon VPC networking and Amazon CloudWatch logging for the Amazon OpenSearch Service domain.&lt;/p&gt; 
&lt;p&gt;The second role is the Application Role, which provides minimal Amazon OpenSearch Service and Amazon S3 access specifically for log processing applications and secure log ingestion operations.&lt;/p&gt; 
&lt;h3&gt;Network requirements&lt;/h3&gt; 
&lt;ul&gt; 
 &lt;li&gt;Amazon VPC with private subnets for secure Amazon OpenSearch Service deployment&lt;/li&gt; 
 &lt;li&gt;NAT Gateway for outbound internet access from private subnets&lt;/li&gt; 
 &lt;li&gt;Security groups configured for HTTPS-only communication&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h3&gt;Running Kiro CLI on Amazon EC2&lt;/h3&gt; 
&lt;h4&gt;Kiro platform requirements:&lt;/h4&gt; 
&lt;p&gt;&lt;strong&gt;Kiro subscription&lt;/strong&gt;&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Active Kiro License&lt;/strong&gt;: Valid subscription to Kiro platform&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;User Account&lt;/strong&gt;: Registered Kiro user account with appropriate permissions&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;API Access&lt;/strong&gt;: Kiro API keys or authentication tokens for CLI access&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;&lt;strong&gt;AWS Identity Center integration&lt;/strong&gt;&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/singlesignon/latest/userguide/what-is.html" target="_blank" rel="noopener noreferrer"&gt;&lt;strong&gt;AWS IAM Identity Center&lt;/strong&gt;&lt;/a&gt;&lt;strong&gt; Setup&lt;/strong&gt;: AWS IAM Identity Center enabled in your &lt;a href="https://docs.aws.amazon.com/organizations/latest/userguide/orgs_introduction.html" target="_blank" rel="noopener noreferrer"&gt;AWS organization&lt;/a&gt;&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Permission Sets&lt;/strong&gt;: Configured permission sets for Kiro users with appropriate AWS access&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;User Assignment&lt;/strong&gt;: Users assigned to relevant AWS accounts and permission sets&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;SAML/OIDC Configuration&lt;/strong&gt;: Identity provider integration if using external identity systems&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h3&gt;Additional prerequisites&lt;/h3&gt; 
&lt;ul&gt; 
 &lt;li&gt;Python 3.7+ and Node.js installed locally&lt;/li&gt; 
 &lt;li&gt;Python 3.11+ for &lt;a href="https://aws.amazon.com/lambda/" target="_blank" rel="noopener noreferrer"&gt;AWS Lambda&lt;/a&gt; runtime environment (required for OpenSearch MCP server compatibility)&lt;/li&gt; 
 &lt;li&gt;Sufficient service quotas for Amazon OpenSearch Service instances and Amazon EC2 resources&lt;/li&gt; 
 &lt;li&gt;Recommended access to the analysis instance via AWS Systems Manager Session Manager (recommended). Amazon EMR clusters running HBase workloads&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-iam-role-for-ec2.html" target="_blank" rel="noopener noreferrer"&gt;EMR_EC2_Default_Role&lt;/a&gt; of Amazon EMR EC2 instance profile can execute describe-stacks on AWS CloudFormation stacks in us-east-1&lt;/li&gt; 
 &lt;li&gt;Basic familiarity with HBase operations&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;The deployment follows AWS security best practices with resource-specific permissions, regional restrictions, and encrypted data storage. All AWS IAM policies implement least-privilege access patterns to help secure operation of the log analysis pipeline.&lt;/p&gt; 
&lt;h2&gt;Walkthrough&lt;/h2&gt; 
&lt;p&gt;This walkthrough demonstrates deploying and configuring the AI-powered HBase troubleshooting solution in five key steps:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;Deploy AWS infrastructure using AWS CloudFormation&lt;/li&gt; 
 &lt;li&gt;Configure Amazon EMR analysis log collection&lt;/li&gt; 
 &lt;li&gt;Process and index HBase data&lt;/li&gt; 
 &lt;li&gt;Enable AI-powered analysis&lt;/li&gt; 
 &lt;li&gt;Add custom knowledge base (optional)&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;The complete solution is available in our &lt;a href="https://github.com/aws-samples/sample-emr-hbase-inconsistencies-detection-recovery-mcp-kiro/tree/main" target="_blank" rel="noopener noreferrer"&gt;GitHub repository&lt;/a&gt;.&lt;/p&gt; 
&lt;h3&gt;Step 1: Deploy the infrastructure&lt;/h3&gt; 
&lt;p&gt;Deploy the required AWS infrastructure including Amazon OpenSearch Service domain, Amazon EC2 instances, and AWS IAM roles.&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;To deploy the infrastructure&lt;/em&gt;&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;Deploy AWS CloudFormation stack. Please update &lt;a href="mailto:your-email@example.com" target="_blank" rel="noopener noreferrer"&gt;your-email@example.com&lt;/a&gt; to an email address for security alerts and Advanced Intrusion Detection Environment (AIDE) reports:&lt;/li&gt; 
&lt;/ol&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;# Deploy to development environment
aws cloudformation create-stack \
  --stack-name dev-hbase-log-analysis \
  --template-body file://cloudformation/hbase-log-analysis-simple.yaml \
  --parameters \
    ParameterKey=EnvironmentName,ParameterValue=dev \
    ParameterKey=EC2InstanceType,ParameterValue=m7g.xlarge \
    ParameterKey=SecurityAlertEmail,ParameterValue=your-email@example.com \
  --capabilities CAPABILITY_IAM \
  --region us-east-1
# Wait for deployment to complete (~15-20 minutes)
aws cloudformation wait stack-create-complete \
  --stack-name dev-hbase-log-analysis \
  --region us-east-1&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;ol start="2"&gt; 
 &lt;li&gt;Note the deployment outputs including Amazon OpenSearch Service endpoint and Amazon EC2 instance details in the AWS CloudFormation console.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-55492.png" target="_blank" rel="noopener"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90183" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-55492.png" alt="AWS CloudFormation stack outputs table displaying infrastructure resource identifiers including IAM roles, EC2 instances, security groups, S3 buckets, OpenSearch domain configuration, and VPC details for an HBase log analysis application in the development environment." width="943" height="497"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;p&gt;The deployment creates:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Amazon OpenSearch Service domain with vector search capabilities&lt;/li&gt; 
 &lt;li&gt;Amazon EC2 instance for data processing and AI analysis&lt;/li&gt; 
 &lt;li&gt;AWS IAM roles with appropriate permissions&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/vpc/latest/userguide/vpc-security-groups.html" target="_blank" rel="noopener noreferrer"&gt;Security groups&lt;/a&gt; and Amazon VPC configuration&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h3&gt;Step 2: Connect to Amazon EC2 instance and set up system&lt;/h3&gt; 
&lt;p&gt;Connect to the Amazon EC2 instance using AWS Systems Manager (SSM) and set up the required components.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;&lt;em&gt;To connect and set up the system&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;Run the following commands to get the instance ID from AWS CloudFormation outputs and connect via AWS Systems Manager (SSM):&lt;/li&gt; 
&lt;/ol&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;# Get instance ID
INSTANCE_ID=$(aws cloudformation describe-stacks \
  --stack-name dev-hbase-log-analysis \
  --query 'Stacks[0].Outputs[?OutputKey==`EC2InstanceId`].OutputValue' \
  --output text \
  --region us-east-1)
# Connect via SSM
aws ssm start-session --target $INSTANCE_ID --region us-east-1&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-55493.png" target="_blank" rel="noopener"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90185" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-55493.png" alt="Terminal screenshot showing AWS CLI commands to retrieve an EC2 instance ID from CloudFormation stack outputs and establish an AWS Systems Manager Session Manager connection to the instance in the us-east-1 region." width="1000" height="287"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;ol start="2"&gt; 
 &lt;li&gt;Clone the repository and run automated setup:&lt;/li&gt; 
&lt;/ol&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-bash"&gt;# On EC2 instance
sudo su - ec2-user

# Re-install aws cli
sudo dnf remove awscli -y

# For ARM64 (Graviton instances - default)
curl "https://awscli.amazonaws.com/awscli-exe-linux-aarch64.zip" -o "awscliv2.zip"

# For x86_64 (if using non-Graviton instances)
# curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"

unzip awscliv2.zip
sudo ./aws/install

# update $PATH in ~/.bashrc
echo 'export PATH=$PATH:/usr/local/bin/' &amp;gt;&amp;gt; ~/.bashrc

# Reload ~/.bashrc
source ~/.bashrc

# Fork and clone the source code repository on GitHub: sample-emr-hbase-inconsistencies-detection-recovery-mcp-kiro
git clone https://github.com/YOUR_USERNAME/sample-emr-hbase-inconsistencies-detection-recovery-mcp-kiro.git hbase-analysis
cd hbase-analysis

# Run automated setup
chmod +x ./scripts/setup/automated-system-setup.sh
./scripts/setup/automated-system-setup.sh \
  --emr-version emr-7.12.0 \
  --stack-name dev-hbase-log-analysis \
  --region us-east-1&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;The automated setup script installs:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;System dependencies (&lt;a href="https://aws.amazon.com/cli/" target="_blank" rel="noopener noreferrer"&gt;awscli&lt;/a&gt;, git, unzip)&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://github.com/astral-sh/uv" target="_blank" rel="noopener noreferrer"&gt;uv package manager&lt;/a&gt; and &lt;a href="https://github.com/opensearch-project/opensearch-mcp-server-py" target="_blank" rel="noopener noreferrer"&gt;OpenSearch MCP Server&lt;/a&gt;&lt;/li&gt; 
 &lt;li&gt;Kiro CLI and &lt;a href="https://kiro.dev/docs/getting-started/authentication/" target="_blank" rel="noopener noreferrer"&gt;configuration&lt;/a&gt; with AWS IAM Identity Center authentication. The script will automatically add Apache HBase open source repo and Apache HBase open source operational tools to knowledge bases&lt;/li&gt; 
 &lt;li&gt;HBase source repositories for your Amazon EMR version&lt;/li&gt; 
 &lt;li&gt;Python dependencies and MCP server configuration&lt;/li&gt; 
&lt;/ul&gt; 
&lt;ol start="3"&gt; 
 &lt;li&gt;Add your own knowledge base to Kiro CLI&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;To enhance Kiro CLI’s analysis capabilities with Apache HBase open-source repositories, your organization’s HBase runbooks and troubleshooting guides, you can add your own knowledge base repositories. Here are the commands. Please periodically validate and maintain your runbook contents so that they remain accurate and up-to-date, reflecting any changes in your HBase environment, configurations, or operational procedures.:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-php"&gt;# Navigate to the HBase repositories directory
cd /opt/hbase-repositories
# Clone your organization's HBase runbook repository
git clone &amp;lt;runbook-repository-url&amp;gt; &amp;lt;your-own-runbook-repo&amp;gt;
# Example:
# git clone https://github.com/your-org/hbase-runbooks.git hbase-runbooks
# git clone https://gitlab.company.com/ops/hbase-troubleshooting.git hbase-troubleshooting
# Add your custom repositories to Kiro CLI knowledge base manually (run these commands inside kiro-cli):
echo "/knowledge add --name \"Your custom HBase knowledge base\" --path /opt/hbase-repositories/&amp;lt;your-own-runbook-repo&amp;gt;" | kiro-cli
# Example:
# echo "/knowledge add --name \"Company HBase runbooks\" --path /opt/hbase-repositories/hbase-runbooks" | kiro-cli
# echo "/knowledge add --name \"HBase troubleshooting guides\" --path /opt/hbase-repositories/hbase-troubleshooting" | kiro-cli&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;h3&gt;Step 3: Configure Amazon EMR log analysis collection&lt;/h3&gt; 
&lt;p&gt;Set up data collection from your Amazon EMR clusters to gather HBase logs, metadata, and consistency reports using the recommended direct collection method.&lt;br&gt; &lt;em&gt;To configure Amazon EMR log analysis collection&lt;/em&gt;&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;On your Amazon EMR cluster primary node, run the following commands to download the collection scripts:&lt;/li&gt; 
&lt;/ol&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-bash"&gt;# On EMR primary node
sudo su - hadoop

# Fork and clone the source code repository on GitHub: sample-emr-hbase-inconsistencies-detection-recovery-mcp-kiro
git clone https://github.com/YOUR_USERNAME/sample-emr-hbase-inconsistencies-detection-recovery-mcp-kiro.git hbase-analysis
cd hbase-analysis&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;ol start="2"&gt; 
 &lt;li&gt;Run the interactive collection wizard:&lt;/li&gt; 
&lt;/ol&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-bash"&gt;# Run collection wizard
python3 scripts/utilities/emr_log_collection/emr_cluster_wizard_v2.py
&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Input the parameters like the EMR cluster’s jobflow ID, the log analysis Amazon S3 bucket name, and the lookback hours. The default value of the lookback hours is 4 hours.&lt;/p&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-55494.png" target="_blank" rel="noopener"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90192" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-55494.png" alt="Terminal screenshot of EMR Cluster Log Collection Wizard V2 showing an interactive command-line interface for configuring HBase diagnostic log collection from Amazon EMR clusters, with step indicators, input fields for job flow ID and S3 bucket, validation confirmations, and lookback hour configuration." width="1000" height="1017"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;ol start="3"&gt; 
 &lt;li&gt;The collection wizard performs these actions:&lt;/li&gt; 
&lt;/ol&gt; 
&lt;ul&gt; 
 &lt;li&gt;Collects HBase logs from local filesystem. Please reference to prerequisites for the access permission.&lt;/li&gt; 
 &lt;li&gt;Runs &lt;code&gt;sudo -u hbase hbase hbck -details&lt;/code&gt; (or hbck2 for HBase 2.x)&lt;/li&gt; 
 &lt;li&gt;Runs &lt;code&gt;hdfs dfs -ls -R /hbase&lt;/code&gt; or &lt;code&gt;aws s3 ls &amp;lt;hbase-root-dir&amp;gt;&lt;/code&gt; –recursive&lt;/li&gt; 
 &lt;li&gt;Runs &lt;code&gt;hbase shell &amp;lt;&amp;lt;&amp;lt; 'scan "hbase:meta"'&lt;/code&gt;&lt;/li&gt; 
 &lt;li&gt;Creates properly named files matching analysis system requirements&lt;/li&gt; 
 &lt;li&gt;Uploads to Amazon S3 with correct naming conventions&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;Here’s the data collection summary:&lt;/p&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-55495.png" target="_blank" rel="noopener"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90193" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-55495.png" alt="Terminal screenshot showing EMR Cluster Log Collection Wizard V2 completion summary with job flow ID, S3 bucket location, 4-hour lookback period, green success confirmation message, S3 file path, and detailed listing of seven collected diagnostic files including HBCK reports, HBase meta table scans, root directory paths, process information, log collection summary, node logs from all servers, and collection metadata in JSON format." width="1000" height="567"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;p&gt;You can check the uploaded contents through &lt;a href="https://aws.amazon.com/cli/" target="_blank" rel="noopener noreferrer"&gt;AWS &lt;/a&gt;&lt;a href="https://aws.amazon.com/cli/" target="_blank" rel="noopener"&gt;CLI&lt;/a&gt;.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-bash"&gt;aws s3 ls s3://&amp;lt;log-path&amp;gt; --recursive&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Here’s a screenshot of the outputs.&lt;/p&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-55496.png" target="_blank" rel="noopener"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90194" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-55496.png" alt="Terminal screenshot showing AWS CLI command output listing HBase diagnostic files and logs collected from an EMR cluster and stored in Amazon S3, displaying timestamps, file sizes, and complete S3 object paths including diagnostics directory with HBCK reports, meta table scans, root directory listings, process information, and logs directory with compressed application logs from HBase master and regionserver nodes." width="1000" height="112"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;ol start="4"&gt; 
 &lt;li&gt;On the Analysis Amazon EC2 instance, download collected files to the Analysis Amazon EC2 instance.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-bash"&gt;# On analytics EC2 instance
sudo su - ec2-user

# Download logs from S3
mkdir -p /tmp/hbase-log-analysis
cd /tmp/hbase-log-analysis
aws s3 sync s3://&amp;lt;S3-BUCKET-NAME&amp;gt;/emr-logs/&amp;lt;EMR-JOBFLOW-ID&amp;gt;/ .
&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;You can get your jobflow ID from Amazon EMR console:&lt;/p&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-55497.png" target="_blank" rel="noopener"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90196" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-55497.png" alt="Amazon EMR clusters management dashboard displaying a table with clusters, showing one cluster entry named &amp;quot;test&amp;quot; in waiting status with green indicator, creation time, elapsed time, normalized instances, along with filter controls, search functionality, pagination showing page 1, and action buttons for View details, Terminate, Clone, and Create cluster operations." width="1000" height="109"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;p&gt;The generated files (&lt;code&gt;hbase-hbase-master-ip-xxx-xxx-xxx-xxx.ec2.internal.log.gz&lt;/code&gt;, &lt;code&gt;hbase-hbase-regionserver-ip-xxx-xxx-xxx-xxx.ec2.internal.log.gz&lt;/code&gt;, &lt;code&gt;hbck_report.txt&lt;/code&gt;, &lt;code&gt;hbase_rootdir_paths.txt&lt;/code&gt;, &lt;code&gt;hbase_meta.txt&lt;/code&gt;, &lt;code&gt;hbase_processes.txt&lt;/code&gt;, &lt;code&gt;log_copy_summary.txt&lt;/code&gt;) should be aligned with the automated processing script requirements as following.&lt;/p&gt; 
&lt;h3&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-55498.png" target="_blank" rel="noopener"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90197" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-55498.png" alt="Terminal screenshot showing recursive ls -lRt command output listing HBase diagnostic files and logs in /tmp/hbase-log-analysis/ directory, displaying file permissions, ownership by ec2-user, file sizes, timestamps, and complete directory structure including diagnostics directory with text files (manifest.json, HBCK report, meta table scan, process information, root directory paths, log copy summary), logs directory with nested nodes subdirectory containing redacted instance IDs, and applications/hbase subdirectories with compressed RegionServer and Master log files." width="1000" height="977"&gt;&lt;/a&gt;&lt;/h3&gt; 
&lt;h3&gt;Step 4: Process and index data&lt;/h3&gt; 
&lt;p&gt;Process the collected HBase data and create vector embeddings for intelligent search capabilities.To process and index the data, please navigate to the project directory on the Analysis EC2 instance, and run &lt;code&gt;automated-log-processing.sh:&lt;/code&gt;&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-shell"&gt;sudo su – ec2-user
cd ~/hbase-analysis
chmod +x ./scripts/processing/automated-log-processing.sh
./scripts/processing/automated-log-processing.sh \
  --job-flow-id j-YOUR-JOB-FLOW-ID \
  --stack-name dev-hbase-log-analysis&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;The processing scripts extract and parse HBase logs and generate dimensional vector embeddings from HBase log messages using sentence transformer models to enable semantic search beyond keyword matching. The system uses the &lt;a href="https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2" target="_blank" rel="noopener noreferrer"&gt;all-MiniLM-L6-v2&lt;/a&gt; model by default (producing 384-dimensional embeddings), but supports configurable models with different embedding dimensions, automatically adapting the &lt;a href="https://docs.opensearch.org/latest/vector-search/creating-vector-index/" target="_blank" rel="noopener noreferrer"&gt;OpenSearch vector index&lt;/a&gt; to match the chosen model’s output. The system processes comprehensive HBase operational data including region operations, compaction activities, Write-Ahead Log events, memstore operations, and cluster management information from HMaster and RegionServer logs. &lt;a href="https://docs.opensearch.org/latest/vector-search/" target="_blank" rel="noopener noreferrer"&gt;Vector embeddings&lt;/a&gt; capture error messages, exception stack traces, performance warnings, and multi-line log entries through intelligent text preprocessing. This semantic representation enables advanced troubleshooting where users can query conceptually for “region server performance issues” or “memory pressure” and receive contextually relevant results across different log files and time periods. The vector search capabilities support error correlation by grouping similar exceptions, performance analysis by identifying related bottlenecks, and operational pattern recognition. Each log entry is stored in Amazon OpenSearch Service with original metadata (timestamp, log level, source file, job flow ID) alongside the embedding vector, enabling both structured queries and AI-powered semantic analysis. This approach transforms raw HBase logs into a searchable knowledge base supporting anomaly detection, trend analysis, and predictive insights for proactive cluster management and troubleshooting.&lt;/p&gt; 
&lt;p&gt;All scripts use AWS IAM authentication automatically. Here’s a screenshot of the data processing outputs.&lt;/p&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-55499.png" target="_blank" rel="noopener"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90198" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-55499.png" alt="Terminal screenshot showing successful completion of HBase log analysis processing, green checkmark, confirmation message &amp;quot;Successfully processed 4 file(s)&amp;quot;, and next steps section displaying three numbered instructions with redacted URLs for accessing OpenSearch Dashboards, starting Kiro CLI for AI-powered analysis, and querying data using job flow ID, followed by troubleshooting documentation references for HBase inconsistency analysis and log analysis guides." width="1000" height="243"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;h3&gt;Step 5: Enable AI-powered analysis&lt;/h3&gt; 
&lt;p&gt;Configure the AI analysis interface to enable natural language queries against your HBase operational data.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;&lt;em&gt;To set up AI-powered analysis&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;Launch Kiro CLI (already configured by automated setup):&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;code&gt;kiro-cli&lt;/code&gt;Check mcp and knowledge bases. &lt;code&gt;/mcp list&lt;/code&gt;&lt;/p&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-554910.png" target="_blank" rel="noopener"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90199" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-554910.png" alt="Terminal screenshot showing MCP list command output displaying one configured MCP server named &amp;quot;opensearch-mcp-server&amp;quot; with command &amp;quot;uvx&amp;quot; in green and white text on dark background with pink shell prompt, featuring a purple &amp;quot;Configured MCP Servers&amp;quot; header with checkbox icon and green horizontal separator line." width="1000" height="137"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;p&gt;&lt;code&gt;/knowledge show&lt;/code&gt;&lt;/p&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-554911.png" target="_blank" rel="noopener"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90200" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-554911.png" alt="Terminal screenshot showing &amp;quot;/knowledge show&amp;quot; command output displaying Agent kiro_default's knowledge base with repositories: Apache HBase source code, and HBase operational tools" width="1000" height="154"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;p&gt;If you cannot see these 2 knowledge bases, you can manually add them through the following commands:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-bash"&gt;# Note: Large repositories (~500MB) may take a while to index. Check progress with: /knowledge show
/knowledge add --name "HBase operational tools" --path /opt/hbase-repositories/hbase-operator-tools"
/knowledge add --name "Apache HBase source code" --path /opt/hbase-repositories/hbase"&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;ol start="2"&gt; 
 &lt;li&gt;Use natural language queries to analyze your HBase data. The AI analysis uses both the OpenSearch MCP Server for querying indexed data and the Filesystem knowledge bases for accessing HBase source code. You can add your custom runbooks for Kiro’s reference as well.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;strong&gt;For HBase inconsistency analysis:&lt;/strong&gt;&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-css"&gt;# HBase Inconsistency Detection and Remediation Guidelines
## Search Strategy
- Use fuzzy search for case variations/typos, term query for exact region IDs, match_phrase for paths, query_string for logs
- Always use .keyword subfields for exact text matching
- Cross-reference filesystem (wildcard: {"wildcard": {"path": "*&amp;lt;region_id&amp;gt;*"}}) with hbase:meta (match: {"match": {"row_key": "&amp;lt;region_id&amp;gt;"}})
- The total region count in hbase meta must match the total matched document count of wildcard path like "*/.regioninfo" in hbase rootdir path.  
- All terms of region_name.keyword for a region encoded name must match a wildcard path like "*/.regioninfo"
- All terms of table_name.keyword for a table must match a wildcard path like "*/.tabledesc*"
- 1595e783b53d99cd5eef43b6debb2682 is the master store region that will locate in &amp;lt;hbase-root-dir&amp;gt;/MasterData/data/master/store/1595e783b53d99cd5eef43b6debb2682/
- May cross check with the raw logs in /tmp/hbase-log-analysis/
## Issue Types
Orphan regions, missing .regioninfo, missing/extra regions in hbase:meta, rowkey holes, stuck RIT, master initialization failures
## Analysis Steps
### 1. Cross-Reference Meta vs Filesystem
- Filesystem regions NOT in hbase:meta → ORPHAN REGION
- Meta regions NOT in filesystem → MISSING REGION
### 2. Validate Region Chain Continuity
- Sort regions by STARTKEY, verify region[i].ENDKEY == region[i+1].STARTKEY
- First STARTKEY must be '', last ENDKEY must be ''
- Gaps → ROWKEY HOLE
### 3. Check Region States
- state != 'OPEN' → Check RIT
- Missing server assignment → UNASSIGNED
- Multiple servers → SPLIT BRAIN
- "deployed_servers" field must have only one region server address like "ip-xxx-xxx-xxx-xxx.ec2.internal,16020,1770781485397" . The value should not be null or have multiple values. 
### 4. Validate .regioninfo Files
- Missing .regioninfo in region directory → CORRUPT REGION
### 5. Cross-Check HBCK Report
- Compare orphan counts, RIT regions, filesystem vs meta region counts
### 6. Analyze Logs
- Search: "updating hbase:meta row=&amp;lt;region&amp;gt;", "STUCK", "RIT", "Failed" + "&amp;lt;region&amp;gt;", "Split"/"Merge" + "&amp;lt;region&amp;gt;"
## Remediation
- Reference knowledge bases: "Apache HBase source code", "HBase operational tools"
- Use hbck2: /usr/lib/hbase-operator-tools/hbase-hbck2.jar
- Prefix commands with sudo -u hbase
- Use aws s3 for S3-based rootdir
- Wait 300s after creating holes before hbck fixMeta (catalog janitor cycle)
- Use unassign instead of deprecated close_region
- If the region does not have .regioninfo in  &amp;lt;hbase-root-dir&amp;gt;/data/&amp;lt;namespace&amp;gt;/&amp;lt;table-name&amp;gt;/&amp;lt;region-encoded-name&amp;gt;/ but hbase:meta has that region's information and that region has been deployed on a healthy region server, you can use hbase shell to unassign and assign the region to re-generate .regioninfo
- Always add "sudo -u hbase hbase" before "hbase shell" and "hbase hbck" commands
## Job flow
Target: &amp;lt;your-job-flow-id&amp;gt;
Inconsistency to detect: All kinds of inconsistencies&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;You can trust or input “y” or “t” to grant Kiro to search through mcp and knowledge bases.&lt;/p&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-554912.png" target="_blank" rel="noopener"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90202" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-554912.png" alt="Terminal screenshot showing MCP tool execution authorization prompt." width="1000" height="91"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;p&gt;You may get some outputs like this: Kiro checked for any HBase issue.&lt;/p&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-554913.png" target="_blank" rel="noopener"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90203" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-554913.png" alt="Terminal screenshot showing HBase database query results for user table entries with server configuration details and an HBase Inconsistency Detection Framework analysis report" width="1000" height="168"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;p&gt;Kiro summarized the examination results.&lt;/p&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-554914.png" target="_blank" rel="noopener"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90204" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-554914.png" alt="Terminal screenshot displaying HBase inconsistency detection analysis results for job flow, showing one critical missing .regioninfo file issue for HBase region in a HBase table, with cluster health metrics, risk assessment, recommended fixes, and generated diagnostic reports." width="1000" height="628"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;p&gt;Kiro provided mitigation commands after Kiro summarized the issue.&lt;/p&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-554915.png" target="_blank" rel="noopener"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90206" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-554915.png" alt="Terminal screenshot displaying a structured HBase quick fix guide with three sections: recommended fix procedure with sequential steps for region reassignment, verification steps using AWS S3 and HBCK2 tools, and impact assessment showing 30-60 second downtime, zero data loss risk, and isolated region scope for fixing missing .regioninfo file in HBase region." width="1000" height="626"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;h2&gt;Cleaning up&lt;/h2&gt; 
&lt;p&gt;To avoid incurring future charges, delete the resources created during this walkthrough.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;&lt;em&gt;To clean up the resources&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;&lt;strong&gt;Delete the AWS CloudFormation stack from &lt;a href="https://aws.amazon.com/console/" target="_blank" rel="noopener noreferrer"&gt;AWS Management Console: &lt;/a&gt;&lt;/strong&gt;&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-554916.png" target="_blank" rel="noopener"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90208" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-554916.png" alt="AWS CloudFormation Stacks management console displaying a list view with stacks, showing the &amp;quot;dev-hbase-log-analysis&amp;quot; stack with CREATE_COMPLETE status, along with action buttons for Delete, Update stack, Stack actions, and Create stack." width="1000" height="139"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;ol start="2"&gt; 
 &lt;li&gt;&lt;strong&gt;Clean up Amazon EMR cluster resources (if created only for this walkthrough):&lt;/strong&gt;&lt;/li&gt; 
&lt;/ol&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-554917.png" target="_blank" rel="noopener"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90209" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-554917.png" alt="AWS EMR Clusters management console showing page clusters with a cluster in &amp;quot;Waiting&amp;quot; status" width="1000" height="138"&gt;&lt;/a&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;ol start="3"&gt; 
 &lt;li&gt;&lt;strong&gt;Verify resource cleanup in the &lt;/strong&gt;&lt;a href="https://console.aws.amazon.com/" target="_blank" rel="noopener noreferrer"&gt;&lt;strong&gt;AWS Console&lt;/strong&gt;&lt;/a&gt; to verify that all resources are deleted and review your AWS bill to confirm no unexpected charges.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;strong&gt;Important considerations:&lt;/strong&gt;&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Amazon OpenSearch Service domains take several minutes to fully delete&lt;/li&gt; 
 &lt;li&gt;Amazon S3 buckets with versioning retain object versions&lt;/li&gt; 
 &lt;li&gt;Use smaller instance types for development to optimize costs&lt;/li&gt; 
 &lt;li&gt;Monitor usage with &lt;a href="https://aws.amazon.com/aws-cost-management/aws-cost-explorer/" target="_blank" rel="noopener noreferrer"&gt;AWS Cost Explorer&lt;/a&gt;&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;Conclusion&lt;/h2&gt; 
&lt;p&gt;In this post, we showed you how to build an AI-powered HBase troubleshooting solution that transforms manual log analysis into an automated workflow. By combining Amazon OpenSearch Service vector search with Amazon Bedrock-powered analysis through the Kiro CLI, operations teams can resolve complex HBase inconsistencies faster and gain deeper operational insights. The solution demonstrates how AI augments human expertise to improve operational efficiency, reducing HBase inconsistency resolution from hours to minutes and root cause identification from days to hours. Ready to transform your HBase operations? Get started with the GitHub repository and explore the Amazon OpenSearch Service documentation for additional guidance on vector search capabilities.&lt;/p&gt; 
&lt;h3&gt;Acknowledgments&lt;/h3&gt; 
&lt;p&gt;The author would like to thank Xi Yang, Anirudh Chawla, and Sasidhar Puthambakkam for their contributions to developing the technical solution. Xi Yang is a Senior Hadoop System Engineer and Amazon EMR subject matter expert at AWS. Anirudh Chawla is an AWS Analytics Specialist Solution Architect who helps organizations empower businesses to harness their data effectively through AWS’s analytics platform. Sasidhar Puthambakkam is a Senior Hadoop Systems Engineer and Amazon EMR Subject Matter Expert who provides architectural guidance for complex BigData workloads.&lt;/p&gt; 
&lt;hr style="width: 80%"&gt; 
&lt;h2&gt;About the authors&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-554918-1.png" target="_blank" rel="noopener"&gt;&lt;img loading="lazy" class="alignleft wp-image-90218 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-554918-1.png" alt="Yu-Ting Su" width="238" height="318"&gt;&lt;/a&gt;
  &lt;/div&gt; 
  &lt;p&gt;Yu-ting Su, Sr. Hadoop System Engineer, AWS Support Engineering. Yu-Ting is a Sr. Hadoop Systems Engineer at Amazon Web Services (AWS). Her expertise is in Amazon EMR and Amazon OpenSearch Service. She’s passionate about distributing computation and helping people to bring their ideas to life.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
					
			
		
		
			</item>
		<item>
		<title>How to use streamlined permissions for Amazon S3 Tables and Iceberg materialized views</title>
		<link>https://aws.amazon.com/blogs/big-data/how-to-use-streamlined-permissions-for-amazon-s3-tables-and-iceberg-materialized-views/</link>
					
		
		<dc:creator><![CDATA[Srividya Parthasarathy]]></dc:creator>
		<pubDate>Mon, 11 May 2026 18:59:36 +0000</pubDate>
				<category><![CDATA[Amazon Athena]]></category>
		<category><![CDATA[Amazon EMR]]></category>
		<category><![CDATA[Amazon Redshift]]></category>
		<category><![CDATA[Amazon S3 Tables]]></category>
		<category><![CDATA[Amazon Simple Storage Service (S3)]]></category>
		<category><![CDATA[Analytics]]></category>
		<category><![CDATA[AWS Big Data]]></category>
		<category><![CDATA[AWS Glue]]></category>
		<category><![CDATA[Intermediate (200)]]></category>
		<guid isPermaLink="false">6aebad2189939edb9af142d1c6da760ad5fa5a2b</guid>

					<description>In this post, we walk through how to set up and manage S3 Tables in the AWS Glue Data Catalog, create and query Iceberg materialized views, and configure access controls that work across your analytics stack with IAM-based authorization.</description>
										<content:encoded>&lt;p&gt;Apache Iceberg has emerged as the open table format for data lakes. It handles petabyte-scale datasets, lets teams evolve schemas and partitions in place, and supports time travel and incremental processing for data lake management at scale. &lt;a href="https://aws.amazon.com/s3/features/tables/" target="_blank" rel="noopener noreferrer"&gt;Amazon S3 Tables&lt;/a&gt; provide a fully managed Apache Iceberg table experience in &lt;a href="https://aws.amazon.com/s3/" target="_blank" rel="noopener noreferrer"&gt;Amazon S3&lt;/a&gt;, optimized for analytics workloads, and integrate with the &lt;a href="https://docs.aws.amazon.com/prescriptive-guidance/latest/serverless-etl-aws-glue/aws-glue-data-catalog.html" target="_blank" rel="noopener noreferrer"&gt;AWS Glue Data Catalog&lt;/a&gt; so AWS analytics services such as &lt;a href="https://aws.amazon.com/redshift" target="_blank" rel="noopener noreferrer"&gt;Amazon Redshift&lt;/a&gt;,&amp;nbsp;&lt;a href="https://aws.amazon.com/emr/" target="_blank" rel="noopener noreferrer"&gt;Amazon EMR&lt;/a&gt;,&amp;nbsp;&lt;a href="https://aws.amazon.com/athena" target="_blank" rel="noopener noreferrer"&gt;Amazon Athena&lt;/a&gt;,&amp;nbsp;&lt;a href="https://aws.amazon.com/sagemaker/" target="_blank" rel="noopener noreferrer"&gt;Amazon SageMaker&lt;/a&gt;, and&amp;nbsp;&lt;a href="https://aws.amazon.com/glue" target="_blank" rel="noopener noreferrer"&gt;AWS Glue&lt;/a&gt; query your data. Together, they form the foundation of a modern data lake architecture on AWS.&lt;/p&gt; 
&lt;p&gt;S3 Tables integrate with the AWS Glue Data Catalog using &lt;a href="https://aws.amazon.com/iam/" target="_blank" rel="noopener noreferrer"&gt;AWS Identity and Access Management (IAM)&lt;/a&gt; – based authorization. If you manage analytics workloads across these services, you can now define permissions across storage, catalog, and compute in a single IAM policy. This gives teams already using IAM a straightforward path to govern access to S3 Tables resources without changing their existing permission model. For fine-grained access controls, you can opt in to AWS Lake Formation at any time through the AWS Management Console, &lt;a href="https://aws.amazon.com/cli" target="_blank" rel="noopener noreferrer"&gt;AWS Command Line Interface (AWS CLI)&lt;/a&gt;, API, or &lt;a href="https://aws.amazon.com/blogs/storage/tag/aws-cloudformation/" target="_blank" rel="noopener noreferrer"&gt;AWS CloudFormation&lt;/a&gt;.&lt;/p&gt; 
&lt;p&gt;Iceberg materialized views created in the Glue Data Catalog extend this foundation by letting you store pre-computed query results as Iceberg data on Amazon S3. When a query repeats aggregations or joins across large datasets, the engine reads directly from the materialized view’s S3 location rather than reprocessing the base tables. A materialized view can reside in S3 Tables or in an S3 general purpose bucket, independent of where its base tables live, which lets you place pre-computed results wherever fits your access patterns and cost model best.&lt;/p&gt; 
&lt;p&gt;In this post, we walk through how to set up and manage S3 Tables in the AWS Glue Data Catalog, create and query Iceberg materialized views, and configure access controls that work across your analytics stack with IAM-based authorization.&lt;/p&gt; 
&lt;h2&gt;&amp;nbsp;Solution overview&lt;/h2&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-58901.png"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90882" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-58901.png" alt="Architecture diagram showing AWS Glue Data Catalog integration with Amazon Athena, AWS Glue, Amazon Redshift, and Amazon EMR through IAM roles and policies, with Amazon S3 storage and optional AWS Lake Formation governance." width="1404" height="1143"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;p&gt;The above architecture illustrates how S3 Tables integrate with AWS Glue Data Catalog using IAM-based authorization, so you can define the necessary permissions across storage, catalog, and query engines in a single IAM policy. This permission model accelerates onboarding for new teams and workloads.&lt;/p&gt; 
&lt;h3&gt;Key architecture components include:&lt;/h3&gt; 
&lt;p&gt;&lt;strong&gt;Storage Layer: &lt;/strong&gt;Data stored as Iceberg tables in Amazon S3 Tables&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Catalog Layer&lt;/strong&gt;: AWS Glue Data Catalog serves as the single metadata repository.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Compute Layer&lt;/strong&gt; – Amazon Athena, AWS Glue, Amazon Redshift, and Amazon EMR connect to a single data Catalog to access Iceberg tables.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Security&lt;/strong&gt;: AWS IAM authorizes access to resources in storage, catalog, and compute layers.&lt;/p&gt; 
&lt;h2&gt;Prerequisites:&lt;/h2&gt; 
&lt;p&gt;To follow along with this post, you must have an &lt;a href="https://aws.amazon.com/resources/create-account/" target="_blank" rel="noopener noreferrer"&gt;AWS account&lt;/a&gt; and an IAM role or user with appropriate permissions and familiarity to the following services:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;IAM&lt;/li&gt; 
 &lt;li&gt;AWS Glue Data Catalog&lt;/li&gt; 
 &lt;li&gt;Amazon S3&lt;/li&gt; 
 &lt;li&gt;Amazon Athena&lt;/li&gt; 
 &lt;li&gt;Amazon Redshift&lt;/li&gt; 
 &lt;li&gt;Amazon EMR&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;For the minimum permissions required for the role/user for metadata and data access, refer to &lt;a href="https://docs.aws.amazon.com/glue/latest/dg/s3tables-catalog-prerequisites.html#s3tables-required-iam-permissions" target="_blank" rel="noopener noreferrer"&gt;required IAM permissions documentation&lt;/a&gt;.&lt;/p&gt; 
&lt;h2&gt;Solution walkthrough&lt;/h2&gt; 
&lt;p&gt;In this walkthrough, you will integrate S3 Tables with the AWS Glue Data Catalog, create Iceberg materialized views, and query data using multiple analytics engines. You will also learn to use materialized views when you have complex aggregations queried frequently but underlying data changes. You can follow these steps to implement the solution. It will take about 45–60 minutes to complete this walkthrough.&lt;/p&gt; 
&lt;h3&gt;Setup S3 Tables and integrate with Glue Data Catalog&lt;/h3&gt; 
&lt;p&gt;Navigate to Amazon S3 console:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;On the left menu, select &lt;strong&gt;Table buckets.&lt;/strong&gt;&lt;/li&gt; 
 &lt;li&gt;Choose the &lt;strong&gt;Create table bucket&lt;/strong&gt; button.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-58902.jpg"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90883" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-58902.jpg" alt="Amazon S3 console showing the Table buckets management page in the US West (N. California) us-west-1 Region with zero table buckets, integration status disabled, and the Create table bucket button highlighted." width="989" height="565"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;ol start="3"&gt; 
 &lt;li&gt;In the next screen, we will fill the name of the bucket as &lt;strong&gt;salesbucket&lt;/strong&gt;. Please ensure the &lt;strong&gt;Enable Integration configuration&lt;/strong&gt; is checked. This step integrates S3 Tables with AWS Glue Data Catalog.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-58903.jpg"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90884" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-58903.jpg" alt="AWS S3 Create table bucket form with General configuration showing bucket name &amp;quot;salesbucket&amp;quot; and Integration with AWS analytics services section with Enable integration checkbox selected." width="989" height="520"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;ol start="4"&gt; 
 &lt;li&gt;Keep the other options as default and choose &lt;strong&gt;Create table bucket&lt;/strong&gt;.&lt;/li&gt; 
 &lt;li&gt;After it is created, you will be redirected back to the list of table buckets. Choose the table bucket &lt;strong&gt;salesbucket&lt;/strong&gt;.&lt;/li&gt; 
 &lt;li&gt;Select the &lt;strong&gt;Create table with Athena&lt;/strong&gt; button.&lt;/li&gt; 
 &lt;li&gt;Create a namespace in S3 Tables which is equivalent to a database in AWS Glue Data Catalog. Enter namespace (database) name as “sales” and click &lt;strong&gt;Create namespace&lt;/strong&gt;.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-58904.jpg"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90885" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-58904.jpg" alt="Create table with Athena dialog in the Amazon S3 salesbucket console showing namespace configuration with &amp;quot;Create a namespace&amp;quot; selected and namespace name set to &amp;quot;sales.&amp;quot;" width="1375" height="747"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;ol start="8"&gt; 
 &lt;li&gt;Choose &lt;strong&gt;Create table with Athena&lt;/strong&gt;, and a new tab will be open with the Amazon Athena console.&lt;/li&gt; 
 &lt;li&gt;When the Amazon Athena console opens, you will see an example of a query to create a table and examples to insert rows in that table. You could use this query block by uncommenting the code and executing each statement individually by highlighting it. At the end, you will have data in the table.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-58905.jpg"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90887" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-58905.jpg" alt="Amazon Athena query editor showing a SQL analytics query on the daily_sales table with results displaying product categories, units sold, total revenue, and average price for February 2024 sales data." width="1357" height="741"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;h3&gt;Query S3 Tables and create materialized view using Amazon EMR:&lt;/h3&gt; 
&lt;p&gt;To run the instruction on Amazon EMR, complete the following steps to configure the cluster:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;Create an IAM role for the Amazon EMR instance profile&amp;nbsp;following the &lt;a href="https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-iam-role.html" target="_blank" rel="noopener noreferrer"&gt;Amazon EMR Management Guide&lt;/a&gt;. Add the following as policies and trust relationship for working on materialized views.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;Replace ACCOUNT_ID with your AWS account ID, Instance_profile_role to the Amazon EMR instance profile role, and REGION with your AWS Region.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-css"&gt;{
&amp;nbsp;&amp;nbsp; "Version":"2012-10-17",
&amp;nbsp;&amp;nbsp; "Statement":[
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp;{
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; "Sid":"GlueDataCatalogPermissions",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; "Effect":"Allow",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; "Action":[
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"glue:GetCatalog",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"glue:GetDatabase",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"glue:CreateTable",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"glue:GetTable",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"glue:GetTables",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"glue:UpdateTable",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"glue:DeleteTable"
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; ],
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; "Resource":[
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"arn:aws:glue:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNT ID&amp;gt;:catalog",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"arn:aws:glue:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNT ID&amp;gt;:catalog/s3tablescatalog",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"arn:aws:glue:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNT ID&amp;gt;:catalog/s3tablescatalog/*",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"arn:aws:glue:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNT ID&amp;gt;:database/salesdb",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"arn:aws:glue:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNT ID&amp;gt;:database/salesdb/*",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"arn:aws:glue:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNT ID&amp;gt;:database/s3tablescatalog",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"arn:aws:glue:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNT ID&amp;gt;:database/s3tablescatalog/*",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"arn:aws:glue:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNT ID&amp;gt;:table/s3tablescatalog/*",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"arn:aws:glue:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNT ID&amp;gt;:table/*/*"
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; ]
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp;},
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp;{
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; "Sid":"S3TablesDataAccessPermissions",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; "Effect":"Allow",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; "Action":[
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"s3tables:GetTableBucket",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"s3tables:GetNamespace",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"s3tables:GetTable",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"s3tables:GetTableMetadataLocation",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"s3tables:GetTableData",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"s3tables:ListTableBuckets",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"s3tables:CreateTable",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"s3tables:PutTableData",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"s3tables:UpdateTableMetadataLocation",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"s3tables:ListNamespaces",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"s3tables:ListTables",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"s3tables:DeleteTable"
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; ],
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; "Resource":[
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"arn:aws:s3tables:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNT ID&amp;gt;:bucket/*"
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; ]
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp;},
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp;{
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; "Effect":"Allow",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; "Action":"iam:PassRole",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; "Resource":"arn:aws:iam::&amp;lt;ACCOUNT ID&amp;gt;:role/service-role/&amp;lt;Instance_profile_role&amp;gt;"
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp;}
&amp;nbsp;&amp;nbsp; ]
}&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Add the following to the trust policy in addition to existing:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-css"&gt;&amp;nbsp;{
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"Sid": "",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"Effect": "Allow",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"Principal": {
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"Service": "glue.amazonaws.com"
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;},
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"Action": "sts:AssumeRole"
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;}&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;ol start="2"&gt; 
 &lt;li&gt;Launch an Amazon EMR cluster 7.12.0 or higher with instance profile role created in the previous step and with Iceberg enabled. For more information, refer to &lt;a href="https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-iceberg-use-spark-cluster.html" target="_blank" rel="noopener noreferrer"&gt;Use an Iceberg cluster with Spark&lt;/a&gt;.&lt;/li&gt; 
 &lt;li&gt;Connect to the primary node of your Amazon EMR cluster by using SSH, and run the following command to start a Spark application with the required configurations:&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;Replace&amp;nbsp;bucket_name with your bucket name.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-css"&gt;spark-sql \
&amp;nbsp;&amp;nbsp;--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
&amp;nbsp;&amp;nbsp;--conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog \
&amp;nbsp;&amp;nbsp;--conf spark.sql.catalog.glue_catalog.type=glue \
&amp;nbsp;&amp;nbsp;--conf spark.sql.catalog.glue_catalog.warehouse=s3://&amp;lt;bucket_name&amp;gt; \
&amp;nbsp;&amp;nbsp;--conf spark.sql.catalog.glue_catalog.glue.region=&amp;lt;region&amp;gt; \
&amp;nbsp;&amp;nbsp;--conf spark.sql.catalog.glue_catalog.glue.id=&amp;lt;accountid&amp;gt;:s3tablescatalog/salesbucket \
&amp;nbsp;&amp;nbsp;--conf spark.sql.catalog.glue_catalog.glue.account-id=&amp;lt;accountid&amp;gt; \
&amp;nbsp;&amp;nbsp;--conf spark.sql.catalog.glue_catalog.client.region=&amp;lt;region&amp;gt; \
&amp;nbsp;&amp;nbsp;--conf spark.sql.optimizer.answerQueriesWithMVs.enabled=true \
&amp;nbsp;&amp;nbsp;--conf spark.sql.defaultCatalog=glue_catalog&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;ol start="4"&gt; 
 &lt;li&gt;Run the following queries to query the daily_sales table.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-sql"&gt;spark-sql ()&amp;gt; use sales;
spark-sql (sales)&amp;gt; select * from daily_sales;
2024-01-15 Laptop 900.0
2024-01-15 Monitor 250.0
2024-01-16 Laptop 1350.0
2024-02-01 Monitor 300.0
2024-02-01 Keyboard 60.0
2024-02-02 Mouse 25.0
2024-02-02 Laptop 1050.0
2024-02-03 Laptop 1200.0
2024-02-03 Monitor 375.0&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;ol start="5"&gt; 
 &lt;li&gt;Create Materialized view.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-sql"&gt;CREATE MATERIALIZED VIEW sales_mv as 
SELECT 
&amp;nbsp; &amp;nbsp; product_category,
&amp;nbsp; &amp;nbsp;&amp;nbsp;COUNT(*) as units_sold,
&amp;nbsp; &amp;nbsp; SUM(sales_amount) as total_revenue, 
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;AVG(sales_amount) as average_price 
FROM 
&amp;nbsp; &amp;nbsp; glue_catalog.sales.daily_sales 
GROUP BY 
&amp;nbsp; &amp;nbsp; product_category;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;A newly created materialized view is populated with the initial query results but does not update automatically as base table data changes. To keep it current, specify a REFRESH EVERY clause when creating the view. This accepts a time interval and unit, so you can define how often the materialized view is recomputed from the base tables.&lt;/p&gt; 
&lt;ol start="6"&gt; 
 &lt;li&gt;Add refresh interval.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-sql"&gt;CREATE MATERIALIZED VIEW sales_mv 
SCHEDULE REFRESH EVERY 2 HOURS as 
SELECT 
&amp;nbsp; &amp;nbsp; product_category,
&amp;nbsp; &amp;nbsp;&amp;nbsp;COUNT(*) as units_sold,
&amp;nbsp; &amp;nbsp; SUM(sales_amount) as total_revenue, 
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;AVG(sales_amount) as average_price 
FROM 
&amp;nbsp; &amp;nbsp; glue_catalog.sales.daily_sales 
GROUP BY 
&amp;nbsp; &amp;nbsp; product_category;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;ol start="7"&gt; 
 &lt;li&gt;Alternatively, you can refresh them manually.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;For manual full refresh, you can use the following command:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-sql"&gt;REFRESH MATERIALIZED VIEW&amp;nbsp;sales_mv FULL;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;For manual incremental refresh, you can use the following command:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-sql"&gt;REFRESH MATERIALIZED VIEW&amp;nbsp;sales_mv;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;For more details, refer to &lt;a href="https://docs.aws.amazon.com/glue/latest/dg/materialized-views.html#materialized-views-refreshing" target="_blank" rel="noopener noreferrer"&gt;Refreshing materialized views&lt;/a&gt;.&lt;/p&gt; 
&lt;ol start="8"&gt; 
 &lt;li&gt;Query the MV.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-sql"&gt;spark-sql (sales)&amp;gt; select * from sales_mv
Keyboard 1 60.0 60.0
Laptop 4 4500.0 1125.0
Mouse 1 25.0 25.0
Monitor 3 925.0 308.3333333333333&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;After the Iceberg materialized views are created, you can access them using IAM principals that have required IAM permissions to Glue Data Catalog resource and its underlying storage.&lt;/p&gt; 
&lt;p&gt;Iceberg materialized views are flexible in how they combine base tables and access control modes. Base tables can reside in S3 general-purpose buckets (with IAM or Lake Formation access control), in S3 Tables (through the s3tablescatalog catalog), or a combination of these—all within a single materialized view definition. The materialized view itself can use either IAM or AWS Lake Formation access control, independently of its base tables.&lt;/p&gt; 
&lt;p&gt;For more details, refer to &lt;a href="https://docs.aws.amazon.com/glue/latest/dg/materialized-views.html#materialized-views-how-they-work" target="_blank" rel="noopener noreferrer"&gt;How materialized views work with AWS Glue&lt;/a&gt;.&lt;/p&gt; 
&lt;h3&gt;Query using Athena:&lt;/h3&gt; 
&lt;p&gt;Additionally, you can query the same materialized view from Athena SQL. The following image shows the same query run on Athena and the resulting output.&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-58906.png"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90888" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-58906.png" alt="Amazon Athena query editor showing SELECT query results from the sales_mv materialized view with product category aggregations including Keyboard and Laptop sales data." width="1429" height="610"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;h3&gt;Query using Amazon Redshift:&lt;/h3&gt; 
&lt;p&gt;To query the S3 Tables in AWS Glue Data Catalog using Amazon Redshift, you must create a database in the default catalog in Glue Data Catalog that points to the S3 Tables catalog.&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;On the AWS Glue console, choose &lt;strong&gt;Databases&lt;/strong&gt;, and then choose &lt;strong&gt;Add Database&lt;/strong&gt;.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-58907.jpg"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90889" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-58907.jpg" alt="AWS Glue Data Catalog Databases page showing one default database in catalog 466053964652, with the Add database button highlighted." width="995" height="556"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;ol start="2"&gt; 
 &lt;li&gt;Choose the &lt;strong&gt;Glue Database resource link&lt;/strong&gt; option, add a name for the database, choose &lt;strong&gt;salesbucket&lt;/strong&gt; on the target catalog and &lt;strong&gt;sales&lt;/strong&gt; as the target database. Then select &lt;strong&gt;Create database&lt;/strong&gt;.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-58908.jpg"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90890" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-58908.jpg" alt="AWS Glue Create a database form with Glue Database Resource Link selected, name set to &amp;quot;salesdb,&amp;quot; target catalog &amp;quot;salesbucket,&amp;quot; and target database &amp;quot;sales.&amp;quot;" width="992" height="553"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;p&gt;After creating the database, we will see the “salesdb” resource link under &lt;strong&gt;Databases&lt;/strong&gt; on AWS Glue Data Catalog.&lt;/p&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-58909.jpg"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90891" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-58909.jpg" alt="AWS Glue Data Catalog Databases page showing two databases: &amp;quot;default&amp;quot; and the newly created &amp;quot;salesdb&amp;quot; resource link with source catalog pointing to s3tablescatalog." width="1368" height="361"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;p&gt;Create IAM role with the following policy for the Amazon Redshift schema creation. Replace the AWS Region and account ID for your account.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Sid":"GlueDataCatalogPermissions",
         "Effect":"Allow",
         "Action":[
            "glue:GetCatalog",
            "glue:GetDatabase",
            "glue:CreateTable",
            "glue:GetTable",
            "glue:GetTables",
            "glue:UpdateTable",
            "glue:DeleteTable"
         ],
         "Resource":[
            "arn:aws:glue:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNTID&amp;gt;:catalog",
            "arn:aws:glue:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNTID&amp;gt;:catalog/s3tablescatalog",
            "arn:aws:glue:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNTID&amp;gt;:catalog/s3tablescatalog/*",
            "arn:aws:glue:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNTID&amp;gt;:database/salesdb",
            "arn:aws:glue:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNTID&amp;gt;:database/salesdb/*",
            "arn:aws:glue:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNTID&amp;gt;:database/s3tablescatalog",
            "arn:aws:glue:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNTID&amp;gt;:database/s3tablescatalog/*",
            "arn:aws:glue:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNTID&amp;gt;:table/s3tablescatalog/*",
            "arn:aws:glue:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNTID&amp;gt;:table/*/*"
         ]
      },
      {
         "Sid":"S3TablesDataAccessPermissions",
         "Effect":"Allow",
         "Action":[
            "s3tables:GetTableBucket",
            "s3tables:GetNamespace",
            "s3tables:GetTable",
            "s3tables:GetTableMetadataLocation",
            "s3tables:GetTableData",
            "s3tables:ListTableBuckets",
            "s3tables:CreateTable",
            "s3tables:PutTableData",
            "s3tables:UpdateTableMetadataLocation",
            "s3tables:ListNamespaces",
            "s3tables:ListTables",
            "s3tables:DeleteTable"
         ],
         "Resource":[
            "arn:aws:s3tables:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNTID&amp;gt;:bucket/*"
         ]
      }
   ]
}&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Create an&amp;nbsp;&lt;a href="https://docs.aws.amazon.com/redshift/latest/gsg/new-user.html" target="_blank" rel="noopener noreferrer"&gt;Amazon Redshift&lt;/a&gt;&amp;nbsp;provisioned cluster or&amp;nbsp;&lt;a href="https://docs.aws.amazon.com/redshift/latest/mgmt/serverless-console-workgroups-create-workgroup-wizard.html" target="_blank" rel="noopener noreferrer"&gt;Amazon Redshift Serverless&lt;/a&gt;, attaching the IAM role created in previous step.&lt;/p&gt; 
&lt;p&gt;To access the AWS Glue Catalog and the resource link, you can now log in to Amazon Redshift as a local user. We use the &lt;strong&gt;admin&lt;/strong&gt; user and Amazon Redshift Query Editor v2.&lt;/p&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-589010.jpg"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90892" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-589010.jpg" alt="Amazon Redshift Query Editor v2 interface connected to Serverless workgroup &amp;quot;s3tablesblog&amp;quot; showing 2 native databases and 1 external database with an empty query editor ready for input." width="581" height="238"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;p&gt;To create the external schema, you must run the following command: Replace ACCOUNT_ID with your AWS Account ID, IAM_ROLE to IAM role created for schema access, and REGION with your AWS Region.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-sql"&gt;CREATE EXTERNAL SCHEMA salesdb
FROM DATA CATALOG DATABASE 'salesdb'
IAM_ROLE 'arn:aws:iam::&amp;lt;ACCOUNT_ID&amp;gt;:role/&amp;lt;IAM_ROLE&amp;gt;'
REGION '&amp;lt;REGION&amp;gt;'
CATALOG_ID '&amp;lt;ACCOUNT_ID&amp;gt;';&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;After you have created the external schema, it will show up on the left side, under the dev database. The table that we created, daily_sales, is available and we can query directly from Amazon Redshift using a local user.&lt;/p&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-589011.jpg"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90893" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-589011.jpg" alt="Amazon Redshift Query Editor v2 showing a SELECT query on the daily_sales table in the salesdb schema with 9 rows of results displaying sale dates, product categories, and sales amounts from January–February 2024." width="1376" height="739"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;h2&gt;Cleanup:&lt;/h2&gt; 
&lt;p&gt;After completing the walkthrough, follow these steps to remove the resources and avoid ongoing charges. These cleanup steps will permanently delete the data, including the daily_sales table and sales_mv materialized view. Make sure that you have backed up the data that you need to retain before proceeding.&lt;/p&gt; 
&lt;p&gt;To avoid incurring future charges, clean up the resources that you created during this walkthrough:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Remove the Glue Data Catalog resources&lt;/li&gt; 
 &lt;li&gt;Delete the&amp;nbsp;table bucket&lt;/li&gt; 
 &lt;li&gt;Terminate and Delete the Amazon Redshift cluster&lt;/li&gt; 
 &lt;li&gt;Terminate and Delete the Amazon EMR cluster&lt;/li&gt; 
 &lt;li&gt;Delete the IAM roles/policies created&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;Conclusion&lt;/h2&gt; 
&lt;p&gt;Amazon S3 Tables now integrate with AWS Glue Data Catalog through IAM-based authorization via a single IAM policy. By consolidating permissions for storage, catalog, and query engines into one IAM policy, you can streamline authorization with AWS analytics services like Amazon Athena, Amazon EMR, and AWS Glue. You can use this streamlined IAM authorization model to build your data lake faster while maintaining enterprise-grade security. For organizations with additionally granular data access requirements, AWS Lake Formation remains available to layer fine-grained access controls on top of this foundation. This is configurable through the AWS Management Console, CLI, API, or CloudFormation. This integration allows AWS analytics users to use IAM and scale their analytics capabilities with reduced operational complexity.&lt;/p&gt; 
&lt;p&gt;To learn more about to S3 Tables and integration with Glue Data catalog, visit: &lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-integration-overview.html" target="_blank" rel="noopener noreferrer"&gt;Amazon S3 Tables integration with AWS analytics services overview&lt;/a&gt; and &lt;a href="https://docs.aws.amazon.com/glue/latest/dg/glue-federation-s3tables.html" target="_blank" rel="noopener noreferrer"&gt;Integrating with Amazon S3 Tables&lt;/a&gt;.&lt;/p&gt; 
&lt;hr&gt; 
&lt;h2&gt;About the authors&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2025/10/16/rserafim.jpg"&gt;&lt;img loading="lazy" class="wp-image-84745 size-thumbnail alignleft" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2025/10/16/rserafim-100x133.jpg" alt="" width="100" height="133"&gt;&lt;/a&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Ricardo Serafim&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/rcserafim/"&gt;Ricardo&lt;/a&gt; is a Senior Analytics Specialist Solutions Architect at AWS. He has been helping companies with Data Warehouse solutions since 2007.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2024/10/18/Milindo.png"&gt;&lt;img loading="lazy" class="alignleft wp-image-70421 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2024/10/18/Milindo.png" alt="" width="100" height="129"&gt;&lt;/a&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Milind Oke&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/milindoke/"&gt;Milind&lt;/a&gt; is a Data Warehouse Specialist Solutions Architect based out of New York. He has been building data warehouse solutions for over 15 years and specializes in Amazon Redshift.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2025/11/26/pratdas.jpeg"&gt;&lt;img loading="lazy" class="alignleft wp-image-85700 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2025/11/26/pratdas.jpeg" alt="" width="100" height="133"&gt;&lt;/a&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Pratik Das&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/das-pratik/"&gt;Pratik&lt;/a&gt; is a Senior Product Manager with AWS Lake Formation. He is passionate about all things data and works with customers to understand their requirements and build delightful experiences. He has a background in building data-driven solutions and machine learning systems.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2025/11/26/srivipar.jpg"&gt;&lt;img loading="lazy" class="size-full wp-image-85701 alignleft" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2025/11/26/srivipar.jpg" alt="" width="100" height="133"&gt;&lt;/a&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Srividya Parthasarathy&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/srividya-parthasarathy-8b71bb32/"&gt;Srividya&lt;/a&gt; is a Senior Big Data Architect on the AWS Lake Formation team. She works with the product team and customers to build robust features and solutions for their analytical data platform. She enjoys building data mesh solutions and sharing them with the community.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
					
			
		
		
			</item>
		<item>
		<title>Improve DynamoDB analytics with AWS Glue zero-ETL schema and partition controls</title>
		<link>https://aws.amazon.com/blogs/big-data/improve-dynamodb-analytics-with-aws-glue-zero-etl-schema-and-partition-controls/</link>
					
		
		<dc:creator><![CDATA[Raju Ansari]]></dc:creator>
		<pubDate>Mon, 11 May 2026 18:51:22 +0000</pubDate>
				<category><![CDATA[Advanced (300)]]></category>
		<category><![CDATA[Amazon DynamoDB]]></category>
		<category><![CDATA[Amazon SageMaker Lakehouse]]></category>
		<category><![CDATA[AWS Glue]]></category>
		<category><![CDATA[Best Practices]]></category>
		<category><![CDATA[Technical How-to]]></category>
		<category><![CDATA[Data Integrations]]></category>
		<category><![CDATA[Data Lake]]></category>
		<category><![CDATA[DynamoDB]]></category>
		<category><![CDATA[zero-ETL]]></category>
		<guid isPermaLink="false">28a8aea903a5d1a76f281acb8fdf7640a3125781</guid>

					<description>In this post, you learn how to replicate Amazon DynamoDB data to Apache Iceberg tables in Amazon S3 through a zero-ETL integration. We walk through the challenges that the DynamoDB nested, schema-flexible data model introduces for analytics workloads, and show you how to configure schema unnesting and data partitioning for a sample product catalog table. We also cover how to query the replicated data in Amazon Athena using standard SQL.</description>
										<content:encoded>&lt;p&gt;You store transactional data in &lt;a href="https://aws.amazon.com/dynamodb/" target="_blank" rel="noopener noreferrer"&gt;Amazon DynamoDB&lt;/a&gt; and get single-digit millisecond performance. However, when you want to run analytics, machine learning (ML), or reporting on that same data, you face a gap: your flexible, semi-structured DynamoDB schemas don’t align with the flat, columnar formats that analytics engines require. Bridging this gap typically means building and maintaining custom ETL pipelines, which adds development cost and operational overhead.&lt;/p&gt; 
&lt;p&gt;&lt;a href="https://docs.aws.amazon.com/glue/latest/dg/zero-etl-using.html" target="_blank" rel="noopener noreferrer"&gt;AWS Glue Zero-ETL&lt;/a&gt; integration removes that pipeline work. It enables replication of your DynamoDB tables to Apache Iceberg tables in &lt;a href="https://aws.amazon.com/s3/" target="_blank" rel="noopener noreferrer"&gt;Amazon Simple Storage Service (Amazon S3)&lt;/a&gt;, then query it directly with &lt;a href="https://aws.amazon.com/athena/" target="_blank" rel="noopener noreferrer"&gt;Amazon Athena&lt;/a&gt;. During setup, you can configure two capabilities that will shape how replicated data looks and performs: &lt;strong&gt;schema unnesting&lt;/strong&gt; flattens nested attributes into individual columns, and &lt;strong&gt;data partitioning&lt;/strong&gt; organizes data so your queries scan only what they need.&lt;/p&gt; 
&lt;p&gt;In this post, you learn how to replicate Amazon DynamoDB data to Apache Iceberg tables in Amazon S3 through a zero-ETL integration. We walk through the challenges that the DynamoDB nested, schema-flexible data model introduces for analytics workloads, and show you how to configure schema unnesting and data partitioning for a sample product catalog table. We also cover how to query the replicated data in Amazon Athena using standard SQL.&lt;/p&gt; 
&lt;h2&gt;Semi-structured data meets analytics&lt;/h2&gt; 
&lt;p&gt;Your product catalog in DynamoDB contains items with nested attributes like product details, pricing tiers, and inventory information. A typical item looks like this:&lt;/p&gt; 
&lt;pre&gt;&lt;code class="lang-json"&gt;{
  "product_id": "P-1001",
  "name": "Wireless Headphones",
  "productdetails": {
    "brand": "AudioTech",
    "category": "Electronics",
    "weight_kg": 0.25,
    "specification": {
       "color": "Black",
       "storage": "128GB"
    }
  },
  "pricing": {
    "list_price": 79.99,
    "discount_pct": 10
  },
  "created_at": 1701388800000
}&lt;/code&gt;&lt;/pre&gt; 
&lt;p&gt;This structure supports fast transactional reads and writes. However, when you replicate this data for analytics, you face two decisions:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;You must decide whether to flatten nested maps like &lt;code&gt;productdetails&lt;/code&gt; into individual columns or preserve them as-is.&lt;/li&gt; 
 &lt;li&gt;You must choose how to organize the data on disk so that queries filtering by brand or date range scan only relevant partitions.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;With AWS Glue Zero-ETL, you address both decisions through configurable schema unnesting and data partitioning.&lt;/p&gt; 
&lt;h2&gt;Solution overview&lt;/h2&gt; 
&lt;p&gt;You replicate data from your DynamoDB table through AWS Glue Zero-ETL into Apache Iceberg tables stored in Amazon S3, then query the results with Amazon Athena. The following diagram illustrates the end-to-end architecture:&lt;/p&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5703-image-1.jpeg"&gt;&lt;img loading="lazy" class="size-full wp-image-90947" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5703-image-1.jpeg" alt="Data flow diagram showing AWS data pipeline: DynamoDB source table → AWS Glue zero-ETL integration → Apache Iceberg on Amazon S3 → Amazon Athena analytics query." width="753" height="261"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;p&gt;AWS Glue zero-ETL ingests data from Amazon DynamoDB, writes it in Apache Iceberg format to your Amazon S3 data lake, and makes it available for SQL queries in Amazon Athena—with no pipelines to build or maintain. With this integration, you:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Save development time&lt;/strong&gt; by skipping custom code and ETL job management&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Keep DynamoDB performance intact&lt;/strong&gt; because replication doesn’t consume table’s provisioned read/write capacity&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Get data within 15 minutes&lt;/strong&gt; of changes in the source table&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Query with standard tools&lt;/strong&gt; because data lands in Apache Iceberg format, an open table format that AWS natively supports for high-performance analytics&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;During setup, you configure two output settings:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;&lt;strong&gt;Schema unnesting in Zero-ETL&lt;/strong&gt;: You choose how nested attributes appear in the target. Flattening nested maps into individual columns streamlines your queries and reduces complexity.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Data partitioning in Zero-ETL&lt;/strong&gt;: You choose how data is organized into partitions. When you filter on a partition column, the query engine reads only matching data instead of scanning everything, cutting both query time and cost.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;h2&gt;Schema unnesting&lt;/h2&gt; 
&lt;p&gt;When you create a zero-ETL integration, you can choose one of three unnesting options. Schema unnesting transforms complex, nested DynamoDB structures into formats that analytics engines can query directly, removing post-processing transformations.&lt;/p&gt; 
&lt;p&gt;Each option changes how nested DynamoDB attributes appear in the target table. The right choice depends on your analytics tools and how consistent your DynamoDB schemas are.&lt;/p&gt; 
&lt;h3&gt;Option 1: No unnesting&lt;/h3&gt; 
&lt;p&gt;This option preserves the original nested structure. DynamoDB maps and lists remain as structured columns in the target.&lt;/p&gt; 
&lt;p&gt;Using the product example, the target table retains &lt;code&gt;productid&lt;/code&gt; and &lt;code&gt;value&lt;/code&gt; as columns to hold DynamoDB partition key and a DynamoDB record respectively.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Recommended for&lt;/strong&gt;: Workloads where your analytics tools natively support querying nested data and you want to preserve the DynamoDB structure unchanged.&lt;/p&gt; 
&lt;h3&gt;Option 2: Unnest one level&lt;/h3&gt; 
&lt;p&gt;This option flattens top-level maps into individual columns. Lists remain nested.&lt;/p&gt; 
&lt;p&gt;With this option, &lt;code&gt;productdetails&lt;/code&gt; and &lt;code&gt;pricing &lt;/code&gt;each become separate columns.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Recommended for&lt;/strong&gt;: Scenarios where your DynamoDB items have a consistent schema and you want to balance structure preservation with query simplicity.&lt;/p&gt; 
&lt;h3&gt;Option 3: Unnest all levels (default)&lt;/h3&gt; 
&lt;p&gt;This option recursively flattens nested structures using dot notation and produces the flattest schema.&lt;/p&gt; 
&lt;p&gt;For the product table, this creates columns such as &lt;code&gt;productdetails.brand, productdetails.category&lt;/code&gt;, &lt;code&gt;productdetails.specification.color&lt;/code&gt; , &lt;code&gt;productdetails.specification.storage&lt;/code&gt; , &lt;code&gt;pricing.list_price&lt;/code&gt;, and &lt;code&gt;pricing.discount_pct&lt;/code&gt;. The pricing map flattens similarly. Each column is directly queryable without nested access patterns.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Recommended for&lt;/strong&gt;: Analytics tools that prefer flat schemas when your DynamoDB items have a reasonably consistent structure. Note that deeply nested or highly variable schemas can produce very wide tables.&lt;/p&gt; 
&lt;h2&gt;Data partitioning&lt;/h2&gt; 
&lt;p&gt;You can speed up your queries and reduce costs by partitioning your replicated data. Partitioning divides data into logical segments on disk.&lt;/p&gt; 
&lt;p&gt;When you include a filter on a partition column in your query, the query engine skips irrelevant segments entirely. This behavior is called &lt;em&gt;partition pruning&lt;/em&gt;: instead of scanning the entire dataset, the engine reads only the data that matches your filter conditions. For large tables, partition pruning can reduce both query runtime and cost significantly.&lt;/p&gt; 
&lt;h3&gt;Default partitioning&lt;/h3&gt; 
&lt;p&gt;If you don’t specify partition columns, AWS Glue Zero-ETL partitions data using the DynamoDB primary key with bucketing. This approach supports general-purpose queries without requiring manual configuration. For specific query patterns or performance requirements, you can define custom partitioning strategies described in the subsections that follow.&lt;/p&gt; 
&lt;h3&gt;Identity partitioning&lt;/h3&gt; 
&lt;p&gt;Identity partitioning uses raw column values to create partitions. You apply this strategy to low-to-medium cardinality columns such as brand, category, or AWS Region. To partition the product table by &lt;code&gt;productdetails.brand&lt;/code&gt; and create a separate partition for each brand, use this configuration:&lt;/p&gt; 
&lt;pre&gt;&lt;code class="lang-json"&gt;{
  "partitionSpec": [
    {
      "fieldName": "productdetails.brand",
      "functionSpec": "identity"
    }
  ]
}&lt;/code&gt;&lt;/pre&gt; 
&lt;p&gt;With this setup, AWS Glue creates one partition directory per unique brand value. When you query for a specific brand, Athena reads only that partition.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Important: &lt;/strong&gt;Avoid identity partitioning on high-cardinality columns such as primary keys or timestamps. This creates many small partitions, which degrades both ingestion and query performance&lt;/p&gt; 
&lt;h3&gt;Time-based partitioning&lt;/h3&gt; 
&lt;p&gt;Time-based partitioning organizes data by timestamp at a chosen granularity: &lt;code&gt;year&lt;/code&gt;, &lt;code&gt;month&lt;/code&gt;, &lt;code&gt;day&lt;/code&gt;, or &lt;code&gt;hour&lt;/code&gt;. You apply this strategy to time-series data and time-range queries. To partition the product table by month on the &lt;code&gt;created_at&lt;/code&gt; column, which stores epoch milliseconds, use this configuration:&lt;/p&gt; 
&lt;pre&gt;&lt;code class="lang-json"&gt;{
  "partitionSpec": [
    {
      "fieldName": "created_at",
      "functionSpec": "month",
      "conversionSpec": "epoch_milli"
    }
  ]
}&lt;/code&gt;&lt;/pre&gt; 
&lt;p&gt;The &lt;code&gt;conversionSpec&lt;/code&gt; parameter tells AWS Glue how to interpret the source timestamp. Supported values: &lt;code&gt;epoch_sec&lt;/code&gt; (Unix seconds), &lt;code&gt;epoch_milli&lt;/code&gt; (Unix milliseconds), and &lt;code&gt;iso&lt;/code&gt; (ISO 8601 format).&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Note: &lt;/strong&gt;The original column values remain unchanged. AWS Glue transforms only the partition column values to timestamp type in the target table&lt;/p&gt; 
&lt;h3&gt;Multi-level partitioning&lt;/h3&gt; 
&lt;p&gt;You can combine strategies for a hierarchical scheme. To partition first by month and then by brand, use this configuration:&lt;/p&gt; 
&lt;pre&gt;&lt;code class="lang-json"&gt;{
  "partitionSpec": [
    {
      "fieldName": "created_at",
      "functionSpec": "month",
      "conversionSpec": "epoch_milli"
    },
    {
      "fieldName": "productdetails.brand",
      "functionSpec": "identity"
    }
  ]
}
&lt;/code&gt;&lt;/pre&gt; 
&lt;p&gt;This scheme supports efficient queries that filter by date range, brand, or both. Place higher-selectivity columns first in the hierarchy and align the scheme with your most common query patterns.&lt;/p&gt; 
&lt;h2&gt;Best practices&lt;/h2&gt; 
&lt;p&gt;Keep these guidelines in mind when you configure your integration:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Avoid identity partitioning on high-cardinality columns&lt;/strong&gt; such as primary keys, timestamps, or system-generated IDs. This leads to partition explosion and degrades performance.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Apply only one time-based function per column.&lt;/strong&gt; For example, don’t partition col1 by year, month, day, and hour simultaneously.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Match &lt;/strong&gt;&lt;code&gt;conversionSpec&lt;/code&gt;&lt;strong&gt; to your actual data format.&lt;/strong&gt; If your timestamps are in epoch milliseconds, use &lt;code&gt;epoch_milli&lt;/code&gt;, not &lt;code&gt;epoch_sec&lt;/code&gt; or &lt;code&gt;iso&lt;/code&gt;.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Choose granularity based on data volume.&lt;/strong&gt; High-volume tables benefit from finer granularity (day or hour). Lower-volume tables work well with coarser granularity (month or year).&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Account for timezone implications with ISO timestamps.&lt;/strong&gt; AWS Glue Zero-ETL normalizes timestamp partition values to UTC.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;Prerequisites&lt;/h2&gt; 
&lt;p&gt;To implement the AWS Glue Zero-ETL integration with a DynamoDB source, you will need:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;An AWS account with least privilege principle&lt;/li&gt; 
 &lt;li&gt;An AWS Glue database (for example, &lt;code&gt;ddb_zero_etl_demo_db&lt;/code&gt;) with an Amazon S3 bucket associated as the database location (&lt;a href="https://docs.aws.amazon.com/glue/latest/dg/zero-etl-prerequisites.html#zero-etl-setup-target-resources-glue-database" target="_blank" rel="noopener noreferrer"&gt;setup instructions&lt;/a&gt;)&lt;/li&gt; 
 &lt;li&gt;AWS Glue Data Catalog settings updated with an AWS Identity and Access Management (IAM) policy that grants fine-grained access control for zero-ETL (&lt;a href="https://docs.aws.amazon.com/glue/latest/dg/zero-etl-prerequisites.html#zero-etl-setup-target-resources-glue-database" target="_blank" rel="noopener noreferrer"&gt;setup instructions&lt;/a&gt;)&lt;/li&gt; 
 &lt;li&gt;Create an IAM role named &lt;code&gt;zetl-role&lt;/code&gt;, to be used by zero-ETL to access data from your DynamoDB table&lt;/li&gt; 
 &lt;li&gt;A DynamoDB source table (for example, &lt;code&gt;product&lt;/code&gt;) configured for zero-ETL integration (&lt;a href="https://docs.aws.amazon.com/glue/latest/dg/zero-etl-sources.html#zero-etl-config-source-dynamodb" target="_blank" rel="noopener noreferrer"&gt;setup instructions&lt;/a&gt;)&lt;/li&gt; 
&lt;/ol&gt; 
&lt;h2&gt;Walkthrough: Create the zero-ETL integration&lt;/h2&gt; 
&lt;p&gt;Complete these steps to create a zero-ETL integration with DynamoDB as the source and Apache Iceberg tables in Amazon S3 as the target.&lt;/p&gt; 
&lt;h3&gt;Step 1: Select the source type&lt;/h3&gt; 
&lt;ol&gt; 
 &lt;li&gt;Open the AWS Glue console.&lt;/li&gt; 
 &lt;li&gt;In the navigation pane, under &lt;strong&gt;Data Integration and ETL&lt;/strong&gt;, choose &lt;strong&gt;Zero-ETL integrations&lt;/strong&gt;.&lt;/li&gt; 
 &lt;li&gt;Choose &lt;strong&gt;Create zero-ETL integration&lt;/strong&gt;.&lt;/li&gt; 
 &lt;li&gt;Select &lt;strong&gt;Amazon DynamoDB&lt;/strong&gt; as the source type, then choose &lt;strong&gt;Next&lt;/strong&gt;.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5703-image-2.png"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90948" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5703-image-2.png" alt="AWS Glue console showing Step 1 of creating a Zero-ETL integration — selecting a source type from 14 available data sources including Amazon DynamoDB, Facebook Ads, Instagram Ads, MySQL, Oracle, PostgreSQL, and Microsoft SQL Server" width="3432" height="1648"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;[Figure 1: Selecting Amazon DynamoDB as the zero-ETL source type]&lt;/em&gt;&lt;/p&gt; 
&lt;h3&gt;Step 2: Configure source and target&lt;/h3&gt; 
&lt;ol&gt; 
 &lt;li&gt;In &lt;strong&gt;Source details&lt;/strong&gt;, select your DynamoDB table (for example, &lt;code&gt;product&lt;/code&gt;).&lt;/li&gt; 
 &lt;li&gt;In &lt;strong&gt;Target details&lt;/strong&gt;:&lt;/li&gt; 
&lt;/ol&gt; 
&lt;ul&gt; 
 &lt;li style="list-style-type: none"&gt; 
  &lt;ul&gt; 
   &lt;li&gt;Select the current account as target.&lt;/li&gt; 
   &lt;li&gt;Choose the catalog and target database (for example, &lt;code&gt;ddb_zero_etl_demo_db&lt;/code&gt;).&lt;/li&gt; 
   &lt;li&gt;Select the IAM role (for example, &lt;code&gt;zetl-role&lt;/code&gt;).&lt;/li&gt; 
  &lt;/ul&gt; &lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5703-image-3.png"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90949" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5703-image-3.png" alt="AWS Glue console Step 2 — configuring source and target for a zero-ETL integration with Amazon DynamoDB &amp;quot;product&amp;quot; table as source and an AWS Glue catalog database &amp;quot;ddb_zero_etl_demo_db&amp;quot; as target" width="3234" height="1622"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;[Figure 2: Configuring source DynamoDB table and target database]&lt;/em&gt;&lt;/p&gt; 
&lt;h3&gt;Step 3: Configure output settings&lt;/h3&gt; 
&lt;ol&gt; 
 &lt;li&gt;Under &lt;strong&gt;Schema unnesting&lt;/strong&gt;, select &lt;strong&gt;Unnest all fields&lt;/strong&gt;.&lt;/li&gt; 
 &lt;li&gt;Under &lt;strong&gt;Data partitioning&lt;/strong&gt;, select &lt;strong&gt;Specify custom partition keys&lt;/strong&gt;.&lt;/li&gt; 
 &lt;li&gt;Enter the partition key (for example, &lt;code&gt;productdetails.brand&lt;/code&gt;) and set the function to &lt;strong&gt;Identity&lt;/strong&gt;.&lt;/li&gt; 
 &lt;li&gt;Choose &lt;strong&gt;Next&lt;/strong&gt;.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5703-image-4.png"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90950" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5703-image-4.png" alt="AWS Glue Zero-ETL integration output settings showing schema unnesting set to &amp;quot;Unnest all fields,&amp;quot; custom partition key &amp;quot;productdetails.brand&amp;quot; configured with Identity function, and target table named &amp;quot;product." width="3404" height="1664"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;[Figure 3: Configuring schema unnesting and partition key settings]&lt;/em&gt;&lt;/p&gt; 
&lt;h3&gt;Step 4: Set integration details&lt;/h3&gt; 
&lt;ol&gt; 
 &lt;li&gt;Optionally configure encryption and replication settings. The default refresh interval is 15 minutes.&lt;/li&gt; 
 &lt;li&gt;Enter a name for the integration (for example, &lt;code&gt;ddb-zero-etl-demo&lt;/code&gt;).&lt;/li&gt; 
 &lt;li&gt;Choose &lt;strong&gt;Next&lt;/strong&gt;.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5703-image-5.png"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90951" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5703-image-5.png" alt="AWS Glue Zero-ETL integration Step 3 — configuring security with AWS managed KMS key, replication refresh interval set to 15 minutes, and integration named &amp;quot;ddb-zero-etl-demd" width="3422" height="1670"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;[Figure 4: Configuring encryption and replication settings]&lt;/em&gt;&lt;/p&gt; 
&lt;h3&gt;Step 5: Review and create&lt;/h3&gt; 
&lt;ol&gt; 
 &lt;li&gt;Review your settings and choose &lt;strong&gt;Create and launch integration&lt;/strong&gt;.&lt;/li&gt; 
 &lt;li&gt;The integration shows as &lt;strong&gt;Active&lt;/strong&gt; within about a minute.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5703-image-6.png"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90952" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5703-image-6.png" alt="AWS Glue Zero-ETL integration Step 4: Review and Create — showing DynamoDB &amp;quot;product&amp;quot; table as source, Glue database &amp;quot;zett_target&amp;quot; as target with IAM role &amp;quot;zett-role,&amp;quot; and partition key &amp;quot;productdetails.brand&amp;quot; with Identity function" width="3440" height="1682"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;[Figure 5: Review and create summary]&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5703-image-7.png"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90953" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5703-image-7.png" alt="AWS Glue Zero-ETL Integration Details page showing &amp;quot;ddb-zero-etl-demo-test&amp;quot; integration with status &amp;quot;Creating,&amp;quot; DynamoDB &amp;quot;product&amp;quot; table as source, Glue database &amp;quot;ddb_zero_etl_demo_db&amp;quot; as target, and a 15-minute refresh interval" width="3428" height="1554"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;[Figure 6: Integration active with successful status]&lt;/em&gt;&lt;/p&gt; 
&lt;h2&gt;Query the replicated data&lt;/h2&gt; 
&lt;p&gt;After the integration is active and the initial replication completes (typically 15–30 minutes), you can query the data in Amazon Athena.&lt;/p&gt; 
&lt;h3&gt;Preview the replicated data&lt;/h3&gt; 
&lt;ol&gt; 
 &lt;li&gt;Open the Amazon Athena console.&lt;/li&gt; 
 &lt;li&gt;In the query editor, select your target database (for example, &lt;code&gt;ddb_zero_etl_demo_db&lt;/code&gt;).&lt;/li&gt; 
 &lt;li&gt;Run a preview query:&lt;/li&gt; 
&lt;/ol&gt; 
&lt;pre&gt;&lt;code class="lang-sql"&gt;SELECT * FROM "ddb_zero_etl_demo_db"."product"LIMIT 10;&lt;/code&gt;&lt;/pre&gt; 
&lt;h3&gt;Verify schema unnesting&lt;/h3&gt; 
&lt;p&gt;With &lt;strong&gt;Unnest all fields&lt;/strong&gt; selected, nested attributes appear as individual columns with dot notation:&lt;/p&gt; 
&lt;pre&gt;&lt;code class="lang-sql"&gt;SELECT "productdetails.brand", "productdetails.category", "pricing.list_price" 
FROM "ddb_zero_etl_demo_db"."product"
WHERE "productdetails.category" = 'Electronics';&lt;/code&gt;&lt;/pre&gt; 
&lt;h3&gt;Verify partition pruning&lt;/h3&gt; 
&lt;p&gt;Queries that filter on the partition column (&lt;code&gt;productdetails.brand&lt;/code&gt;) automatically skip irrelevant partitions:&lt;/p&gt; 
&lt;pre&gt;&lt;code class="lang-sql"&gt;SELECT product_id, name, "pricing.list_price"
FROM "ddb_zero_etl_demo_db"."product"
WHERE "productdetails.brand" = 'AudioTech';&lt;/code&gt;&lt;/pre&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5703-image-8.png"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90954" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5703-image-8.png" alt="Amazon Athena Query Editor showing a completed SQL query selecting brand, category, and product ID from a DynamoDB zero-ETL Glue catalog table, returning two results: Samsung SmartPhone P22445 and TechCo SmartPhone P12345" width="3392" height="1622"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;[Figure 7: Athena query to &lt;/em&gt;retrieve&lt;em&gt; the data from Apache Iceberg lakehouse]&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;You can verify the partition structure by navigating to the Amazon S3 bucket associated with your database. The data organizes into directories like:&lt;/p&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5703-image-9.png"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90955" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5703-image-9.png" alt="Amazon S3 bucket browser showing the &amp;quot;data/&amp;quot; folder in &amp;quot;ddb-zero-etl-demo-bucket&amp;quot; with two partitioned folders: &amp;quot;productdetails.brand=Samsung/&amp;quot; and &amp;quot;productdetails.brand=TechCo/&amp;quot; — confirming Iceberg partition structure from DynamoDB zero-ETL integration" width="3438" height="1404"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;[Figure 8: Amazon S3 bucket organization for the identity partition productdetails.brand]&lt;/em&gt;&lt;/p&gt; 
&lt;h2&gt;Clean up&lt;/h2&gt; 
&lt;p&gt;To avoid ongoing charges, delete the resources in this order:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;&lt;strong&gt;Delete the zero-ETL integration.&lt;/strong&gt; In the AWS Glue console, navigate to &lt;strong&gt;Zero-ETL integrations&lt;/strong&gt;, select your integration, and choose &lt;strong&gt;Delete&lt;/strong&gt;. Existing replicated data remains in the target, but new changes stop replicating.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Delete the replicated table.&lt;/strong&gt; In the AWS Glue Data Catalog, navigate to &lt;strong&gt;Tables&lt;/strong&gt;, select the replicated table, and delete it.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Delete the AWS Glue database.&lt;/strong&gt; In the Data Catalog, select the database and delete it.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Delete the Amazon S3 data.&lt;/strong&gt; Empty and delete the S3 bucket associated with the database.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Delete the DynamoDB table.&lt;/strong&gt; If you created it for this walkthrough, delete the source table.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Delete IAM resources.&lt;/strong&gt; Remove the IAM role and policies created for the integration.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;h2&gt;Conclusion&lt;/h2&gt; 
&lt;p&gt;You configured schema unnesting and data partitioning for a DynamoDB zero-ETL integration, replicated a product catalog table to Apache Iceberg tables in Amazon S3, and verified the results in Amazon Athena. Unnesting flattened nested attributes into directly queryable columns. Partitioning helped the query engine skip irrelevant data, reducing both query time and cost. To take your integration further, try monitoring replication lag and data freshness with Amazon CloudWatch metrics. You can also experiment with different partitioning strategies on a staging table before applying them to production workloads, testing time-based partitioning alongside identity partitioning to find the optimal scheme for your query patterns. For broader analytics coverage, query the same Iceberg tables from Amazon Redshift Spectrum or Amazon EMR alongside Athena. For more details, explore these resources:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/glue/latest/dg/zero-etl-using.html" target="_blank" rel="noopener noreferrer"&gt;AWS Glue Zero-ETL integrations&lt;/a&gt;&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/glue/latest/dg/zero-etl-monitoring.html" target="_blank" rel="noopener noreferrer"&gt;Monitoring zero-ETL integrations&lt;/a&gt;&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/athena/latest/ug/what-is.html" target="_blank" rel="noopener noreferrer"&gt;Amazon Athena documentation&lt;/a&gt;&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Introduction.html" target="_blank" rel="noopener noreferrer"&gt;Amazon DynamoDB Developer Guide&lt;/a&gt;&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/prescriptive-guidance/latest/apache-iceberg-on-aws/introduction.html" target="_blank" rel="noopener noreferrer"&gt;Apache Iceberg on AWS&lt;/a&gt;&lt;/li&gt; 
&lt;/ul&gt; 
&lt;hr style="width: 80%"&gt; 
&lt;h2&gt;About the authors&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5703-image-10.png"&gt;&lt;img loading="lazy" class="alignleft size-full wp-image-90956" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5703-image-10.png" alt="" width="100" height="132"&gt;&lt;/a&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Raju Ansari&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/rajuansari/"&gt;Raju&lt;/a&gt; is a Senior Software Development Engineer at AWS, specializing in building scalable, secure, serverless solutions that simplify data analytics and AI agent development. He helps organizations modernize their data analytics infrastructure and develop cutting-edge AI agentic applications. Currently, Raju focuses on building foundational AI services, including Amazon Bedrock Agents, which enable developers to create intelligent, autonomous applications at scale. Outside of work, Raju is passionate about giving back to the tech community. He actively volunteers at IEEE events and mentor early and mid-career professionals&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5703-image-11.jpeg"&gt;&lt;img loading="lazy" class="alignleft size-full wp-image-90957" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5703-image-11.jpeg" alt="" width="100" height="133"&gt;&lt;/a&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Shashank Sharma&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/shashankkumarsharma/"&gt;Shashank&lt;/a&gt; is an Engineering Leader with over 15 years of experience delivering data integration and replication solutions for first-party and third-party databases and SaaS for enterprise customers. He leads engineering for AWS Glue Zero-ETL and Amazon AppFlow, building fully managed pipelines that replicate data from sources like Salesforce, SAP, DynamoDB, and Oracle into Amazon Redshift and Apache Iceberg-based data lakes. Shashank advises startups on technology strategy and mentors engineers and technical leaders at various career stages&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
					
			
		
		
			</item>
		<item>
		<title>How to build a cross-Region resilience for Amazon OpenSearch Service with Amazon MSK</title>
		<link>https://aws.amazon.com/blogs/big-data/how-to-build-a-cross-region-resilience-for-amazon-opensearch-service-with-amazon-msk/</link>
					
		
		<dc:creator><![CDATA[Sriharsha Subramanya Begolli]]></dc:creator>
		<pubDate>Mon, 11 May 2026 18:46:43 +0000</pubDate>
				<category><![CDATA[Amazon Managed Streaming for Apache Kafka (Amazon MSK)]]></category>
		<category><![CDATA[Amazon OpenSearch Ingestion]]></category>
		<category><![CDATA[Amazon OpenSearch Serverless]]></category>
		<category><![CDATA[Industries]]></category>
		<category><![CDATA[Intermediate (200)]]></category>
		<category><![CDATA[Technical How-to]]></category>
		<guid isPermaLink="false">324ae3147cd5c4fdaa603ee2a1e41ea1e7599e94</guid>

					<description>In this post, we outline the solution that provides cross-Region resiliency without needing to reestablish relationships during a fail-back, using an active-active replication model with Amazon OpenSearch Ingestion (OSI) and Amazon Managed Streaming for Apache Kafka (Amazon MSK). This solution applies to both OpenSearch Service managed clusters and Amazon OpenSearch Serverless collections. We use Amazon OpenSearch Serverless as an example for the configurations in this post.</description>
										<content:encoded>&lt;p&gt;Cross-Region resilience for &lt;a href="https://aws.amazon.com/opensearch-service/" target="_blank" rel="noopener noreferrer"&gt;Amazon OpenSearch Service&lt;/a&gt; has historically been a complex challenge, relying on &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/managedomains-snapshot-create.html" target="_blank" rel="noopener noreferrer"&gt;S3-based snapshots&lt;/a&gt; or &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/replication.html" target="_blank" rel="noopener noreferrer"&gt;cross-cluster replication&lt;/a&gt; that demand intricate manual failover procedures often resulting in hours of downtime, data inconsistencies, and significant lag during outages, or other operational disruptions. To overcome these limitations and help businesses stay focused on their core objectives, we’ve developed a solution that automatically maintains synchronized data across AWS Regions while supporting active-active operations in both AWS Regions.&lt;/p&gt; 
&lt;p&gt;AWS offers two &lt;a href="https://opensearch.org/" target="_blank" rel="noopener noreferrer"&gt;OpenSearch&lt;/a&gt; offerings, namely &lt;a href="https://aws.amazon.com/opensearch-service/" target="_blank" rel="noopener noreferrer"&gt;Amazon OpenSearch Service&lt;/a&gt;, a managed cluster-based service where you provision and manage OpenSearch domains (nodes, storage, scaling), and &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless.html" target="_blank" rel="noopener noreferrer"&gt;Amazon OpenSearch Serverless&lt;/a&gt;, a serverless option where AWS automatically manages infrastructure and scaling and you create collections for your search or analytics workloads. OpenSearch Service provides high availability (HA) within an AWS Region through its Multi-AZ deployment model and provides Regional resiliency with &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/replication.html" target="_blank" rel="noopener noreferrer"&gt;cross-cluster replication&lt;/a&gt;. &lt;a href="https://docs.aws.amazon.com/msk/latest/developerguide/msk-replicator.html" target="_blank" rel="noopener noreferrer"&gt;Amazon Managed Streaming for Apache Kafka (Amazon MSK) Replicator&lt;/a&gt; is an Amazon MSK feature that you can use to reliably replicate data across Amazon MSK clusters in different or the same AWS Region.&lt;/p&gt; 
&lt;p&gt;In this post, we outline the solution that provides cross-Region resiliency without needing to reestablish relationships during a fail-back, using an &lt;a href="https://aws.amazon.com/blogs/architecture/disaster-recovery-dr-architecture-on-aws-part-iv-multi-site-active-active/" target="_blank" rel="noopener noreferrer"&gt;active-active replication model&lt;/a&gt; with &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/ingestion.html" target="_blank" rel="noopener noreferrer"&gt;Amazon OpenSearch Ingestion (OSI)&lt;/a&gt; and &lt;a href="https://aws.amazon.com/msk/" target="_blank" rel="noopener noreferrer"&gt;Amazon Managed Streaming for Apache Kafka (Amazon MSK).&lt;/a&gt; This solution applies to both OpenSearch Service managed clusters and Amazon OpenSearch Serverless collections. We use Amazon OpenSearch Serverless as an example for the configurations in this post.&lt;/p&gt; 
&lt;h2&gt;&lt;strong&gt;Solution overview&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;In this solution we use Amazon MSK Replicator for bidirectional cross-Region data replication, with OSI pipelines to index data into Amazon OpenSearch Serverless collections in each AWS Region. While the &lt;a href="https://aws.amazon.com/blogs/big-data/achieve-cross-region-resilience-with-amazon-opensearch-ingestion/" target="_blank" rel="noopener noreferrer"&gt;S3 based approach&lt;/a&gt; serves the purpose, Amazon MSK Replicator provides near real-time replication with identical topic naming, which supports active-active operations. Amazon MSK Replicator provides automatic loop prevention and consumer group offset synchronization, enabling seamless cross-Region failover. You can find the code for the entire solution in the GitHub &lt;a href="https://github.com/aws-samples/sample-opensearch-cross-region-resilience-with-msk/tree/main" target="_blank" rel="noopener noreferrer"&gt;repo&lt;/a&gt;.&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90797" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/bdb-5738-image-1.png" alt="" width="1085" height="599"&gt;Your architecture will follow a Regional-first approach where data sources write to a local Amazon MSK cluster within their AWS Region. In this sample deployment, an &lt;a href="https://aws.amazon.com/lambda/" target="_blank" rel="noopener noreferrer"&gt;AWS Lambda&lt;/a&gt; function serves as the producer, streaming data into the MSK cluster. OSI pipelines consume the incoming data from the local MSK cluster and persist it to an Amazon OpenSearch Serverless collection within the same AWS Region. To achieve cross-Region data synchronization, Amazon MSK Replicator facilitates bidirectional replication between the Amazon MSK clusters, preserving the same topic names across both environments. This design validates that Amazon OpenSearch Serverless collections in each AWS Region maintain identical datasets, provides low-latency search capabilities and high availability for globally distributed workloads.&lt;/p&gt; 
&lt;h2&gt;&lt;strong&gt;Prerequisites&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;Deploy the AWS &lt;a href="https://github.com/aws-samples/sample-opensearch-cross-region-resilience-with-msk/tree/main/cloudformation" target="_blank" rel="noopener noreferrer"&gt;Cloudformation template&lt;/a&gt; to install the prerequisites. The solution has the following prerequisite steps:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;&lt;strong&gt;Set up &lt;/strong&gt;&lt;a href="https://docs.aws.amazon.com/vpc/latest/userguide/what-is-amazon-vpc.html" target="_blank" rel="noopener noreferrer"&gt;&lt;strong&gt;Amazon Virtual Private Cloud (Amazon VPC)&lt;/strong&gt;&lt;/a&gt;&lt;strong&gt; infrastructure in both Regions&lt;/strong&gt; 
  &lt;ol type="a"&gt; 
   &lt;li&gt;Create Amazon VPCs with private subnets in at least two or three Availability Zones for high availability at the AWS Region level&lt;/li&gt; 
   &lt;li&gt;Configure &lt;a href="https://docs.aws.amazon.com/vpc/latest/userguide/vpc-nat-gateway.html" target="_blank" rel="noopener noreferrer"&gt;Network Address Translation (NAT) Gateways&lt;/a&gt; for outbound internet access from &lt;a href="https://docs.aws.amazon.com/vpc/latest/userguide/configure-subnets.html" target="_blank" rel="noopener noreferrer"&gt;private subnets&lt;/a&gt;&lt;/li&gt; 
   &lt;li&gt;Use &lt;a href="https://docs.aws.amazon.com/vpc/latest/userguide/subnet-sizing.html" target="_blank" rel="noopener noreferrer"&gt;non-overlapping CIDR blocks&lt;/a&gt;&lt;/li&gt; 
  &lt;/ol&gt; &lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Establish Amazon OpenSearch Serverless collections in both AWS Regions&lt;/strong&gt;&lt;/li&gt; 
 &lt;li&gt;Create Amazon OpenSearch Serverless Collections for log analytics&lt;/li&gt; 
 &lt;li&gt;Configure &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-encryption.html" target="_blank" rel="noopener noreferrer"&gt;encryption&lt;/a&gt;, &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-network.html" target="_blank" rel="noopener noreferrer"&gt;network&lt;/a&gt;, and &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-data-access.html" target="_blank" rel="noopener noreferrer"&gt;data access policies&lt;/a&gt;&lt;/li&gt; 
 &lt;li&gt;Create &lt;a href="https://docs.aws.amazon.com/vpc/latest/privatelink/create-interface-endpoint.html" target="_blank" rel="noopener noreferrer"&gt;Amazon VPC endpoints&lt;/a&gt; for private access&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Configure MSK clusters in both AWS Regions&lt;/strong&gt;&lt;/li&gt; 
 &lt;li&gt;Enable &lt;a href="https://docs.aws.amazon.com/msk/latest/developerguide/security-iam.html" target="_blank" rel="noopener noreferrer"&gt;AWS Identity and Access Management (IAM) authentication (SASL/IAM)&lt;/a&gt;&lt;/li&gt; 
 &lt;li&gt;Enable &lt;a href="https://docs.aws.amazon.com/msk/latest/developerguide/mvpc-getting-started.html" target="_blank" rel="noopener noreferrer"&gt;Multi-VPC connectivity&lt;/a&gt; (required for Amazon MSK Replicator and OSI)&lt;/li&gt; 
 &lt;li&gt;Configure &lt;a href="https://docs.aws.amazon.com/msk/latest/developerguide/security_iam_service-with-iam.html" target="_blank" rel="noopener noreferrer"&gt;MSK cluster policies&lt;/a&gt; to allow kafka.amazonaws.com and osis-pipelines.amazonaws.com service principals&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Configure IAM permissions for pipeline and replication access&lt;/strong&gt;&lt;/li&gt; 
 &lt;li&gt;Create &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/security-iam-ingestion.html" target="_blank" rel="noopener noreferrer"&gt;IAM roles for the OSI pipelines&lt;/a&gt; with permissions to access Amazon Managed Streaming for Apache Kafka and Amazon OpenSearch Serverless.&lt;/li&gt; 
 &lt;li&gt;Create &lt;a href="https://docs.aws.amazon.com/msk/latest/developerguide/security-iam-awsmanpol-AWSMSKReplicatorExecutionRole.html" target="_blank" rel="noopener noreferrer"&gt;IAM roles for the Amazon MSK Replicator&lt;/a&gt; with permissions for cross-Region access to Amazon Managed Streaming for Apache Kafka clusters.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;a href="https://github.com/aws-samples/sample-opensearch-cross-region-resilience-with-msk/tree/main/cloudformation" target="_blank" rel="noopener noreferrer"&gt;This AWS CloudFormation&lt;/a&gt; template helps you in deploying all of the required configurations with primary AWS Region as &lt;code&gt;us-east-1&lt;/code&gt; and secondary AWS Region as &lt;code&gt;us-west-2&lt;/code&gt;.&lt;/p&gt; 
&lt;p&gt;The following snippets shows the configuration for the OSI pipeline, which writes data from Amazon MSK to Amazon OpenSearch Serverless. The OSI pipeline uses MSK as a source with IAM authentication.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre class="unlimited-height-code"&gt;&lt;code class="lang-yaml"&gt;version: "2"
kafka-pipeline:
source:
kafka:
acknowledgments: true
topics:
- name: "opensearch-data"
group_id: "osi-consumer-group-primary"
aws:
msk:
arn: "arn:aws:kafka:us-east-1:&amp;lt;aws-acccount-id&amp;gt;:cluster/production-msk-primary/CLUSTER_ID"
region: "us-east-1"
sts_role_arn: "arn:aws:iam::&amp;lt;aws-acccount-id&amp;gt;:role/production-osi-pipeline-primary-role"
sink:
- opensearch:
hosts:
- "https://&amp;lt;OPENSEARCH_SERVERLESS_COLLECTION_ID&amp;gt;.us-east-1.aoss.amazonaws.com"
index: "application-logs-${yyyy.MM.dd}"
aws:
serverless: true
region: "us-east-1"
sts_role_arn: "arn:aws:iam::&amp;lt;aws-acccount-id&amp;gt;:role/production-osi-pipeline-primary-role"
dlq:
s3:
bucket: "production-opensearch-dlq-us-east-1"
region: "us-east-1"
sts_role_arn: "arn:aws:iam::&amp;lt;aws-acccount-id&amp;gt;:role/production-osi-pipeline-primary-role"&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;The &lt;a href="https://github.com/aws-samples/sample-opensearch-cross-region-resilience-with-msk/blob/main/docs/iamrole-replicator-config.md" target="_blank" rel="noopener noreferrer"&gt;OSI pipeline IAM Role&lt;/a&gt; has the required permission for Amazon MSK and Amazon OpenSearch Serverless to consume message data from the source and write data to the destination. For true active-active replication, sample deploys &lt;a href="https://github.com/aws-samples/sample-opensearch-cross-region-resilience-with-msk/blob/main/docs/iamrole-replicator-config.md" target="_blank" rel="noopener noreferrer"&gt;two Amazon MSK Replicators&lt;/a&gt; in each AWS Region. Each Amazon MSK cluster requires &lt;a href="https://github.com/aws-samples/sample-opensearch-cross-region-resilience-with-msk/blob/main/docs/iamrole-replicator-config.md" target="_blank" rel="noopener noreferrer"&gt;cluster policy&lt;/a&gt; to allow Amazon MSK Replicator and OSI to connect. To validate the bidirectional replication, the solution uses AWS Lambda functions to produce test messages to both Amazon MSK clusters.&lt;/p&gt; 
&lt;p&gt;When an application generates an event, it first publishes the message to an Apache Kafka topic in the Regional streaming cluster powered by Amazon Managed Streaming for Apache Kafka. In this sample deployment, an AWS Lambda function simulates application activity by producing events into the topic. These events are durably stored in the Apache Kafka partitions, providing a reliable buffer between producers and downstream consumers. An ingestion pipeline built using Amazon OpenSearch Ingestion continuously reads the event stream from the Apache Kafka topic and prepares the data for indexing. The pipeline then indexes the processed events into a collection in Amazon OpenSearch Serverless, making the data searchable in near real time.&lt;/p&gt; 
&lt;p&gt;At the same time, Amazon MSK Replicator replicates the Apache Kafka topic to a peer Amazon MSK cluster in a secondary AWS Region while preserving the topic structure. This makes the same event stream available in the secondary AWS Region without requiring changes to downstream consumers. An OpenSearch Ingestion pipeline in the secondary AWS Region consumes the replicated topic and indexes the events into its local OpenSearch Serverless collection. As events continue to flow through the system, both AWS Regions maintain synchronized datasets that can be queried independently. This architecture enables low-latency Regional search while maintaining a resilient, cross-Region copy of the indexed data.&lt;/p&gt; 
&lt;h2&gt;&lt;strong&gt;Failover scenario and considerations&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;You can failover your application to the Amazon OpenSearch Serverless collection in the other AWS Region and continue operations without interruption. The data present before the impairment is available in both collections. Upon recovery, Amazon MSK Replicator and OSI pipelines automatically resume operations without manual intervention. Data that you write to the healthy AWS Region during the impairment is automatically backfilled to the recovered AWS Region. For detailed step-by-step guidance, see &lt;a href="https://github.com/aws-samples/sample-opensearch-cross-region-resilience-with-msk/blob/main/docs/disaster-recovery-testing.md" target="_blank" rel="noopener noreferrer"&gt;disaster recovery section in GitHub repo&lt;/a&gt;.&lt;/p&gt; 
&lt;p&gt;When using Amazon MSK Replicator, be aware that cross-Region data transfer incurs additional costs. To help verify reliability, configure &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/osis-features-overview.html#osis-features-dlq" target="_blank" rel="noopener noreferrer"&gt;Dead Letter Queues (DLQ) for OSI pipelines&lt;/a&gt; to capture failed document ingestion. Additionally, monitor essential &lt;a href="https://aws.amazon.com/cloudwatch/" target="_blank" rel="noopener noreferrer"&gt;Amazon CloudWatch&lt;/a&gt; metrics including ReplicationLatency for tracking lag between clusters, DocumentsFailed for identifying ingestion issues, and MessagesInPerSec for observing message throughput.&lt;/p&gt; 
&lt;p&gt;Persistent buffering in OSI provides a built-in safety net that prevents data loss when data producers send information faster than your OpenSearch cluster can process it, removing the need to provision and manage separate buffering infrastructure. By using managed storage across multiple Availability Zones, this feature enhances data durability while dynamically allocating &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-scaling.html" target="_blank" rel="noopener noreferrer"&gt;OpenSearch Compute Units (OCUs)&lt;/a&gt; for both buffering and data processing, which incurs additional costs. Persistent buffering isn’t enabled by default. Without it, the OSI pipeline relies on an in-memory buffer, which is volatile and has limited capacity for storing incoming data before processing.&lt;/p&gt; 
&lt;h2&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;In this post, we showed you how to achieve cross-Regional resiliency for Amazon OpenSearch Serverless and OpenSearch Service managed clusters. In our experiments, most writes of a few KBs of data completed within one to a few seconds between the two chosen AWS Regions. Replication lag between the AWS Regions depends on network delay between chosen Regions and the settings configured on Amazon Opensearch Ingestion (OSI) pipeline.&lt;/p&gt; 
&lt;p&gt;Refer to &lt;a href="https://aws.amazon.com/legal/service-level-agreements/" target="_blank" rel="noopener noreferrer"&gt;AWS Service Level Agreements (SLAs)&lt;/a&gt; and &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/ingestion.html" target="_blank" rel="noopener noreferrer"&gt;Amazon Opensearch Ingestion&lt;/a&gt; (OSI) for more details. You can also achieve active-passive replication for OpenSearch using OSI and Amazon Simple Storage Service (Amazon S3) as mentioned in another post &lt;a href="https://aws.amazon.com/blogs/big-data/achieve-cross-region-resilience-with-amazon-opensearch-ingestion/" target="_blank" rel="noopener noreferrer"&gt;Achieve cross-Region resilience with Amazon OpenSearch Ingestion&lt;/a&gt;.&lt;/p&gt; 
&lt;hr style="width: 80%"&gt; 
&lt;h2&gt;About the authors&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignleft wp-image-90798 size-thumbnail" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/bdb-5738-image-2-100x98.png" alt="" width="100" height="98"&gt;
  &lt;/div&gt; 
  &lt;p&gt;&lt;strong&gt;Sriharsha Subramanya Begolli&lt;/strong&gt; works as a Senior Solutions Architect with AWS, based in Bengaluru, India. His primary focus is assisting large enterprise customers in modernising their applications and developing cloud-based systems to meet their business objectives. His expertise lies in the domains of data, analytics and generative AI.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignleft wp-image-90799 size-thumbnail" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/bdb-5738-image-3-100x133.png" alt="" width="100" height="133"&gt;
  &lt;/div&gt; 
  &lt;p&gt;&lt;strong&gt;Qais Poonawala&lt;/strong&gt; is a Senior Technical Account Manager at AWS Enterprise Support, India, who specializes in Cloud Operations and Security while helping customers architect highly scalable, resilient, and secure solutions. With extensive experience in enabling enterprise customers across AWS services, he has a passion for solving complex challenges and developing solutions around Security, Cloud Operations, and GenAI.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignleft wp-image-90800 size-thumbnail" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/bdb-5738-image-4-100x119.png" alt="" width="100" height="119"&gt;
  &lt;/div&gt; 
  &lt;p&gt;&lt;strong&gt;Jay Jothi &lt;/strong&gt;is a Senior Technical Account Manager based in Chennai, India, where he supports major enterprise customers in maximizing the benefits of cloud technology. With extensive experience in the financial services industry and a specialization in Cloud Operations, he focuses on helping financial clients manage data efficiently, derive actionable insights using GenAI, and deliver cost-effective solutions.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
					
			
		
		
			</item>
		<item>
		<title>How to consolidate cross-Region S3 data into OpenSearch</title>
		<link>https://aws.amazon.com/blogs/big-data/how-to-consolidate-cross-region-s3-data-into-opensearch/</link>
					
		
		<dc:creator><![CDATA[David Venable]]></dc:creator>
		<pubDate>Fri, 08 May 2026 13:37:47 +0000</pubDate>
				<category><![CDATA[Amazon OpenSearch Ingestion]]></category>
		<category><![CDATA[Amazon OpenSearch Service]]></category>
		<category><![CDATA[Intermediate (200)]]></category>
		<category><![CDATA[Technical How-to]]></category>
		<guid isPermaLink="false">050f3f71ba06a80f03ca8dc732f00f739ae8f70f</guid>

					<description>We’re happy to announce that Amazon OpenSearch Ingestion pipelines can now read from S3 buckets in different Regions to ingest and consolidate data into a single OpenSearch Service domain or collection. In this post, I'll show you how to use the new cross-Region support to ingest data from S3 buckets across multiple AWS Regions into a single OpenSearch Service domain or collection.</description>
										<content:encoded>&lt;p&gt;You might have data in &lt;a href="https://aws.amazon.com/s3/" target="_blank" rel="noopener"&gt;Amazon Simple Storage Service&lt;/a&gt; (Amazon S3) buckets in different AWS Regions that you want available in a single &lt;a href="https://aws.amazon.com/opensearch-service/" target="_blank" rel="noopener"&gt;Amazon OpenSearch Service&lt;/a&gt; domain or collection. Consolidating data across Regions provides unified analytics and searches, reduce operation complexity, and streamline your search infrastructure. We’re happy to announce that Amazon OpenSearch Ingestion pipelines can now read from S3 buckets in different Regions to ingest and consolidate data into a single OpenSearch Service domain or collection.&lt;/p&gt; 
&lt;p&gt;To consolidate this data across AWS Regions, you previously had to provide your own solution. Now Amazon OpenSearch Ingestion can help you accomplish this. In this post, I’ll show you how to use the new cross-Region support to ingest data from S3 buckets across multiple AWS Regions into a single OpenSearch Service domain or collection.&lt;/p&gt; 
&lt;p&gt;Amazon OpenSearch Ingestion (OSI) is a feature-rich data ingestion pipeline that you can use for many different purposes: observability, analytics, and zero-ETL search. Many customers use OpenSearch Ingestion to ingest data from Amazon S3 into OpenSearch Service domains and Amazon OpenSearch Serverless collections. Until now, you could only ingest from a single AWS Region at a time. Now that you can use OpenSearch Ingestion for cross-Region S3 ingestion, I’ll show you how you can use it in two scenarios: batch processing using S3 scan, and streaming ingestion using Amazon Simple Queue Service (Amazon SQS) queues for AWS vended logs like Amazon Virtual Private Cloud (Amazon VPC) Flow Logs and AWS CloudTrail.&lt;/p&gt; 
&lt;h2&gt;&lt;strong&gt;Prerequisites&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;Complete the following prerequisite steps:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/createupdatedomains.html" target="_blank" rel="noopener noreferrer"&gt;Deploy an OpenSearch Service domain&lt;/a&gt; or &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-collections.html" target="_blank" rel="noopener noreferrer"&gt;OpenSearch Serverless collection&lt;/a&gt; in the Regions where you want to perform your search or analytics.&lt;/li&gt; 
 &lt;li&gt;You need S3 buckets in at least two different Regions. You can use existing ones or &lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket-overview.html" target="_blank" rel="noopener noreferrer"&gt;create S3 buckets&lt;/a&gt;. You can use one in the same AWS Region as your OpenSearch Service domain or collection, or use two completely different Regions.&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/upload-objects.html" target="_blank" rel="noopener noreferrer"&gt;Upload objects&lt;/a&gt; with data into your S3 buckets. The data can be JSON, ND-JSON, Parquet, CSV, or plaintext formats.&lt;/li&gt; 
 &lt;li&gt;Configure &lt;a href="https://aws.amazon.com/iam/" target="_blank" rel="noopener noreferrer"&gt;AWS Identity and Access Management&lt;/a&gt; (IAM) permissions needed for OSI. For instructions, see &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/configure-client-s3.html#s3-source" target="_blank" rel="noopener noreferrer"&gt;Amazon S3 as a source&lt;/a&gt;.&lt;/li&gt; 
 &lt;li&gt;For cross-Region ingestion, you must now also include the s3:GetBucketLocation permission. This gives the pipeline the ability to determine which AWS Region the bucket is located in.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;After you complete these steps, you can either set up your Amazon OpenSearch Ingestion pipelines for batch or streaming scenarios. In the following sections, I’ll give you recommendations on when to choose which approach, and I outline the steps for creating your pipeline.&lt;/p&gt; 
&lt;h2&gt;&lt;strong&gt;Batch scenarios&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;You can use the OpenSearch Ingestion S3 scan capability to read batch data from S3. You might find this approach useful when your data is written to S3 on a schedule. To perform a cross-Region S3 scan, you only specify the buckets that you’re reading from when you create the OpenSearch Ingestion pipeline.&lt;/p&gt; 
&lt;p&gt;The following diagram shows the design for an OpenSearch Ingestion pipeline in &lt;code&gt;us-west-2&lt;/code&gt; reading from S3 buckets in &lt;code&gt;us-east-1&lt;/code&gt; and &lt;code&gt;eu-west-1&lt;/code&gt; and writing that data into an OpenSearch Service domain in &lt;code&gt;us-west-2&lt;/code&gt;.&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone wp-image-90616 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/25/BDB-5804-image-1.jpg" alt="" width="701" height="421"&gt;&lt;/p&gt; 
&lt;p&gt;Next, you will &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/creating-pipeline.html" target="_blank" rel="noopener noreferrer"&gt;create an OpenSearch Ingestion pipeline&lt;/a&gt;. You must create this pipeline in the same Region as your OpenSearch Service domain or collection.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;version: "2"
s3-scan-cross-region:
  source:
    s3:
      compression: automatic
      codec:
        json:
      scan:
        buckets:
          - bucket:
              name: amzn-s3-demo-bucket1
          - bucket:
              name: amzn-s3-demo-bucket2
      aws:
        region: us-west-2

  sink:
    - opensearch:
        hosts: [ "https://search-mydomain-abcdefghijklmn.us-west-2.es.amazonaws.com" ]
        index: s3_scan_cross_region
        aws:
          region: us-west-2
&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;The previous pipeline configuration supports the JSON codec. You might want to &lt;a href="https://docs.opensearch.org/latest/data-prepper/pipelines/configuration/sources/s3/#codec" target="_blank" rel="noopener noreferrer"&gt;configure a different codec&lt;/a&gt; if your data isn’t a large JSON object.&lt;/p&gt; 
&lt;p&gt;You can now query your OpenSearch Service domain or collection to see the data that you ingested.&lt;/p&gt; 
&lt;h2&gt;&lt;strong&gt;Streaming scenarios: AWS vended logs&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;Like many of our customers, you might want to ingest S3 data from different AWS Regions into OpenSearch Service. A common reason is to consolidate AWS vended logs. For example, VPC Flow Logs, CloudTrail data, and load balancer logs. For these scenarios, you can configure OpenSearch Ingestion pipelines to read from an Amazon SQS queue to stream data into your OpenSearch Service domain or collection.&lt;/p&gt; 
&lt;p&gt;These AWS vended logs write to Amazon S3 in the same AWS Region as the service running it. For example, VPC Flow Logs will be in the same AWS Region as your Amazon VPC. You can use OpenSearch Ingestion to consolidate these logs into one AWS Region. In the VPC Flow Logs example, you can consolidate your VPC Flow Logs from multiple AWS Regions into a single OpenSearch Service domain or collection to analyze network patterns from your different Amazon VPCs.&lt;/p&gt; 
&lt;p&gt;The following diagram outlines the overall setup. It shows an example of sending AWS vended logs from &lt;code&gt;us-east-1&lt;/code&gt; and &lt;code&gt;eu-west-1&lt;/code&gt; to an OpenSearch Service domain in &lt;code&gt;us-west-2&lt;/code&gt;. You can change the AWS Regions depending on your specific needs.&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone wp-image-90617 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/25/BDB-5804-image-2.jpg" alt="" width="1001" height="421"&gt;&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;You must configure your vended logs to write log events to Amazon S3 buckets in their respective AWS Regions. Using VPC Flow Logs as our example, you can &lt;a href="https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html" target="_blank" rel="noopener noreferrer"&gt;configure VPC Flow Logs for your VPCs&lt;/a&gt;.&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/creating-sqs-standard-queues.html" target="_blank" rel="noopener noreferrer"&gt;Create an Amazon SQS queue&lt;/a&gt; in the same AWS Region as your OpenSearch Service domain.&lt;/li&gt; 
 &lt;li&gt;Amazon S3 doesn’t send notifications to cross-Region Amazon SQS queues, so you will use intermediate Amazon Simple Notification Service (Amazon SNS) topics to consolidate the notifications from multiple Regions into one queue. For each S3 bucket, &lt;a href="https://docs.aws.amazon.com/sns/latest/dg/sns-create-topic.html" target="_blank" rel="noopener noreferrer"&gt;create an SNS topic&lt;/a&gt;.&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/EventNotifications.html" target="_blank" rel="noopener noreferrer"&gt;Configure S3 Event Notifications for SNS&lt;/a&gt;. You will do this for each S3 bucket and each SNS topic.&lt;/li&gt; 
 &lt;li&gt;SNS can send cross-Region notifications to SQS. &lt;a href="https://docs.aws.amazon.com/sns/latest/dg/sns-create-subscribe-endpoint-to-topic.html" target="_blank" rel="noopener noreferrer"&gt;Create a subscription&lt;/a&gt; from each SNS topic that you created in step 3 to the single SQS queue you created in step 2.&lt;/li&gt; 
 &lt;li&gt;Configure your pipeline role to read from SQS and read from the relevant S3 buckets.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;Now &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/creating-pipeline.html" target="_blank" rel="noopener noreferrer"&gt;create an OpenSearch Ingestion pipeline&lt;/a&gt; in the same AWS Region as your OpenSearch Service domain.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;version: "2"
s3-sqs-cross-region:
  source:
    s3:
      notification_type: sqs
      codec:
        newline:
      sqs:
        queue_url: https://sqs.us-west-2.amazonaws.com/123456789012/amzn-s3-demo-all-regions
      aws:
        region: us-west-2

  sink:
    - opensearch:
        hosts: [ "https://search-mydomain-abcdefghijklmn.us-west-2.es.amazonaws.com" ]
        index: s3_sqs_cross_region
        aws:
          region: us-west-2
&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;The previous pipeline configuration supports the JSON codec. You might want to &lt;a href="https://docs.opensearch.org/latest/data-prepper/pipelines/configuration/sources/s3/#codec" target="_blank" rel="noopener noreferrer"&gt;configure a different codec&lt;/a&gt; if your data is not a large JSON object.&lt;/p&gt; 
&lt;p&gt;Next, &lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/upload-objects.html" target="_blank" rel="noopener noreferrer"&gt;upload objects&lt;/a&gt; with data into your S3 buckets. By uploading data, S3 will send notifications to SNS and then the SQS queue.&lt;/p&gt; 
&lt;p&gt;You can now query your OpenSearch Service domain or collection to see the data that you ingested.&lt;/p&gt; 
&lt;p&gt;Here is what makes this possible and what is different. The &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/configure-client-s3.html" target="_blank" rel="noopener noreferrer"&gt;SQS queue receives the event notifications&lt;/a&gt; for the buckets. Before the cross-Region feature of OpenSearch Ingestion, the pipeline could see these events, but couldn’t access the S3 bucket even if the permissions were granted. Now, the pipeline will determine the AWS Region that the bucket is in, access an AWS Security Token Service (AWS STS) token for the AWS Region of the bucket. Using the STS token from the same Region as the S3 bucket allows the pipeline to read and access the data.&lt;/p&gt; 
&lt;h2&gt;&lt;strong&gt;Using the AWS Console&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;When you create the pipeline using the OpenSearch Ingestion console, you will have options to select a blueprint for your use-case. These blueprints help you create pipelines for various vended log types only by selecting your SQS queue and OpenSearch domain. The blueprint handles the data type mappings for you by including appropriate processors. You can use these blueprints as a starting point and modify your processors for your specific requirements.&lt;/p&gt; 
&lt;h2&gt;&lt;strong&gt;Clean up resources&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;When you’re done testing this out, use the following resources to delete the resources that you created.&lt;/p&gt; 
&lt;p&gt;If you set up a batch pipeline:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/delete-pipeline.html" target="_blank" rel="noopener noreferrer"&gt;Delete&lt;/a&gt; the OpenSearch Ingestion pipeline.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;If you set up a streaming pipeline:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/delete-pipeline.html" target="_blank" rel="noopener noreferrer"&gt;Delete&lt;/a&gt; the OpenSearch Ingestion pipeline.&lt;/li&gt; 
 &lt;li&gt;If you created an SQS queue, &lt;a href="https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/step-delete-queue.html" target="_blank" rel="noopener noreferrer"&gt;delete the SQS queue&lt;/a&gt;.&lt;/li&gt; 
 &lt;li&gt;If you created SNS topics, &lt;a href="https://docs.aws.amazon.com/sns/latest/dg/sns-delete-subscription-topic.html" target="_blank" rel="noopener noreferrer"&gt;delete the SNS topics&lt;/a&gt;.&lt;/li&gt; 
 &lt;li&gt;If you configured AWS vended logs you can delete those logging configurations. This example used VPC Flow Logs. For instructions on how to do so, see &lt;a href="https://docs.aws.amazon.com/vpc/latest/userguide/working-with-flow-logs.html#delete-flow-log" target="_blank" rel="noopener noreferrer"&gt;Delete the Flow Logs&lt;/a&gt;.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;For both pipelines, these steps help you delete the common resources.&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_manage_delete.html" target="_blank" rel="noopener noreferrer"&gt;Delete the IAM roles&lt;/a&gt; that you created for your pipeline.&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/DeletingObjects.html" target="_blank" rel="noopener noreferrer"&gt;Delete the S3 objects&lt;/a&gt; that you uploaded and the &lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/delete-bucket.html" target="_blank" rel="noopener noreferrer"&gt;S3 bucket&lt;/a&gt;.&lt;/li&gt; 
 &lt;li&gt;Delete the &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/gsgdeleting.html" target="_blank" rel="noopener noreferrer"&gt;Amazon OpenSearch domain&lt;/a&gt; or the &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-delete.html" target="_blank" rel="noopener noreferrer"&gt;Amazon OpenSearch Serverless collection&lt;/a&gt;.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;In this post, I showed you how you can use Amazon OpenSearch Ingestion to ingest data from Amazon S3 buckets in different AWS Regions. I showed that this works for both batch scan and streaming scenarios. The feature offers you a straightforward way to consolidate your data from other Regions into one OpenSearch Service domain or collection.&lt;/p&gt; 
&lt;p&gt;To get started with the cross-Region S3 source, refer to the &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/ingestion.html" target="_blank" rel="noopener noreferrer"&gt;OpenSearch Ingestion documentation&lt;/a&gt; or try creating a pipeline from one of our blueprints using the OpenSearch Ingestion console. You can &lt;a href="https://docs.opensearch.org/latest/data-prepper/common-use-cases/codec-processor-combinations/" target="_blank" rel="noopener noreferrer"&gt;read about the codecs&lt;/a&gt; that OpenSearch Ingestion offers for parsing your S3 objects. You can also learn how about the &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/pipeline-config-reference.html" target="_blank" rel="noopener noreferrer"&gt;various processors&lt;/a&gt; that OpenSearch Ingestion offers, so you can transform and enrich your data to meet your needs.&lt;/p&gt; 
&lt;p&gt;You can also use OpenSearch Ingestion for cross-Region and cross-account. To do this, you must grant &lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-walkthroughs-managing-access-example2.html" target="_blank" rel="noopener noreferrer"&gt;cross-account permissions&lt;/a&gt; on your S3 bucket. You must also make some changes to your &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/configure-client-s3.html#fdsf" target="_blank" rel="noopener noreferrer"&gt;pipeline configuration&lt;/a&gt;. Combining what I showed you in this post with the existing cross-account features greatly expands your ingestion options.&lt;/p&gt; 
&lt;p&gt;If you’re ready to take your streaming ingestion analytics to the next level you can read about how to &lt;a href="https://docs.opensearch.org/latest/data-prepper/common-use-cases/metrics-logs/" target="_blank" rel="noopener noreferrer"&gt;generate metrics from logs&lt;/a&gt; and even how to send those derived metrics to &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/configure-client-prometheus.html" target="_blank" rel="noopener noreferrer"&gt;Amazon Managed Service for Prometheus&lt;/a&gt;.&lt;/p&gt; 
&lt;p&gt;Have you tried out the cross-Region capabilities of OpenSearch Ingestion? Share your use-cases and questions in the comments.&lt;/p&gt; 
&lt;hr&gt; 
&lt;h3&gt;About the authors&lt;/h3&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignnone wp-image-90515 size-thumbnail" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/21/BDB-5804-image-3-100x100.jpeg" alt="" width="100" height="100"&gt;
  &lt;/div&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/venabledavid/" target="_blank" rel="noopener noreferrer"&gt;David&lt;/a&gt; is a senior software engineer working on observability in OpenSearch at Amazon Web Services. He is a maintainer on the Data Prepper project.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
					
			
		
		
			</item>
		<item>
		<title>Enable real-time mainframe analytics with Precisely Connect and Amazon S3</title>
		<link>https://aws.amazon.com/blogs/big-data/enable-real-time-mainframe-analytics-with-precisely-connect-and-amazon-s3/</link>
					
		
		<dc:creator><![CDATA[Supreet Padhi, Rochelle Grubbs]]></dc:creator>
		<pubDate>Fri, 08 May 2026 13:29:29 +0000</pubDate>
				<category><![CDATA[Amazon S3 Tables]]></category>
		<category><![CDATA[Analytics]]></category>
		<category><![CDATA[Customer Solutions]]></category>
		<category><![CDATA[Intermediate (200)]]></category>
		<category><![CDATA[Partner solutions]]></category>
		<guid isPermaLink="false">84b11f9d80f5065fddba0ee94737c75e53799c8a</guid>

					<description>In this post, we discuss how you can use Precisely Connect to enable real-time, direct replication of mainframe data to Amazon Simple Storage Service (Amazon S3), and how your organization can extend this foundation using Amazon S3 Tables for advanced analytics.</description>
										<content:encoded>&lt;p&gt;&lt;em&gt;This is a guest post by Supreet Padhi, Technology Architect, Strategic Technologies, and Rochelle Grubbs, Senior Director, Solution Architect at Precisely &lt;/em&gt;&lt;em&gt;in partnership with AWS.&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;Business leaders face a critical challenge to enable real-time analytics. Their most valuable data sits in mainframe systems that reliably process billions of transactions daily, but extracting value for modern analytics and AI remains complex and costly. Traditional mainframe-to-cloud integration approaches require multi-step replication with intermediary systems, creating operational overhead, latency, and data integrity risks. This complexity delays insights, increases infrastructure costs, limits agility, and blocks organizations from using AI and machine learning on their mainframe data.&lt;/p&gt; 
&lt;p&gt;&lt;a href="https://www.precisely.com/" target="_blank" rel="noopener noreferrer"&gt;Precisely&lt;/a&gt;, a global leader in data integrity with over 12,000 customers including 95 of the Fortune 100, has &lt;a href="https://www.precisely.com/press-release/precisely-accelerates-mainframe-modernization-with-real-time-data-replication-to-amazon-s3/" target="_blank" rel="noopener noreferrer"&gt;announced&lt;/a&gt; an expansion of its collaboration with AWS through new enhancements to Precisely Connect. Precisely is an &lt;a href="https://partners.amazonaws.com/partners/001E000000fgBJWIA2/Precisely" target="_blank" rel="noopener noreferrer"&gt;AWS Data and Analytics ISV Competency and AWS Migration and Modernization ISV Competency&lt;/a&gt; partner. Precisely has service specializations in &lt;a href="https://aws.amazon.com/redshift/" target="_blank" rel="noopener noreferrer"&gt;Amazon Redshift&lt;/a&gt; and &lt;a href="https://aws.amazon.com/rds/" target="_blank" rel="noopener noreferrer"&gt;Amazon Relational Database Service (Amazon RDS)&lt;/a&gt;.&lt;/p&gt; 
&lt;p&gt;In &lt;a href="https://aws.amazon.com/blogs/big-data/stream-mainframe-data-to-aws-in-near-real-time-with-precisely-and-amazon-msk/" target="_blank" rel="noopener noreferrer"&gt;Stream mainframe data to AWS in near-real time with Precisely and Amazon MSK&lt;/a&gt;, we showed you how to set up mainframe CDC and the AWS Mainframe Modernization – Data Replication for IBM z/OS Amazon Machine Image (AMI) available in &lt;a href="https://aws.amazon.com/marketplace/pp/prodview-sjnpyprulmohs" target="_blank" rel="noopener noreferrer"&gt;AWS Marketplace&lt;/a&gt;. In this post, we discuss how you can use Precisely Connect to enable real-time, direct replication of mainframe data to &lt;a href="https://aws.amazon.com/s3/" target="_blank" rel="noopener noreferrer"&gt;Amazon Simple Storage Service (Amazon S3)&lt;/a&gt;, and how your organization can extend this foundation using &lt;a href="https://aws.amazon.com/s3/features/tables/" target="_blank" rel="noopener noreferrer"&gt;Amazon S3 Tables&lt;/a&gt; for advanced analytics.&lt;/p&gt; 
&lt;h2&gt;Real-time mainframe data access&lt;/h2&gt; 
&lt;p&gt;Organizations that can connect their mainframe environments with modern cloud platforms can gain advantages through improved agility, reduced operational costs, and enhanced analytics capabilities.For example, moving appropriate analytics and reporting workloads to the cloud can significantly reduce mainframe operational costs while maintaining performance and reliability. Real-time data access makes insights available within seconds rather than waiting for batch processing cycles, enabling faster responses to market changes and customer needs. Eliminating bulk data extracts and intermediary systems also reduces infrastructure and maintenance expenses. This frees IT resources to focus on higher-value initiatives.&lt;/p&gt; 
&lt;p&gt;However, implementing mainframe-to-cloud integrations presents unique technical challenges that require specialized solutions. These include converting mainframe character encoding (EBCDIC) to standard ASCII format and handling mainframe-specific data types such as packed decimal (COMP) fields. You also need to manage the complexity of VSAM (Virtual Storage Access Method) files that can store multiple record types in a single file, and maintain real-time synchronization without impacting mainframe performance.&lt;/p&gt; 
&lt;p&gt;Change Data Capture (CDC) technology addresses these challenges through incremental data movement that eliminates disruptive bulk extracts by streaming only changed data to cloud targets, minimizing system impact and ensuring data currency. Real-time synchronization keeps cloud applications in sync with mainframe systems, enabling immediate insights and responsive operations.&lt;/p&gt; 
&lt;h2&gt;Precisely Connect: Real-time data replication to Amazon S3&lt;/h2&gt; 
&lt;p&gt;With Precisely Connect, you can replicate data directly from mainframes to Amazon S3 in real time, eliminating the need for intermediaries and simplifying modernization.Data flows directly from mainframe sources, including Db2 z/OS, IMS, and VSAM, to Amazon S3, eliminating intermediary steps and reducing both latency and operational complexity. You can move mainframe data directly to Amazon S3 data lakes and analytics platforms without managing complex, multi-step replication processes.&lt;/p&gt; 
&lt;p&gt;The simplicity of this approach reduces maintenance overhead and integration complexity by removing the need for staging servers, middleware, or batch processing systems. After data lands in Amazon S3, it becomes immediately available for downstream AWS workloads. You can use &lt;a href="https://aws.amazon.com/athena/" target="_blank" rel="noopener noreferrer"&gt;Amazon Athena&lt;/a&gt; for SQL queries, &lt;a href="https://aws.amazon.com/glue/" target="_blank" rel="noopener noreferrer"&gt;AWS Glue&lt;/a&gt; for ETL and data cataloging, &lt;a href="https://aws.amazon.com/emr/" target="_blank" rel="noopener noreferrer"&gt;Amazon EMR&lt;/a&gt; for big data processing, &lt;a href="https://aws.amazon.com/sagemaker/ai/" target="_blank" rel="noopener noreferrer"&gt;Amazon SageMaker AI&lt;/a&gt; for machine learning, and &lt;a href="https://aws.amazon.com/quick/quicksight/" target="_blank" rel="noopener noreferrer"&gt;Amazon Quick Sight&lt;/a&gt; for business intelligence dashboards.&lt;/p&gt; 
&lt;h2&gt;Solution overview&lt;/h2&gt; 
&lt;p&gt;Here we present a solution architecture for streaming mainframe data changes from Db2z through &lt;a href="https://aws.amazon.com/marketplace/pp/prodview-sjnpyprulmohs" target="_blank" rel="noopener noreferrer"&gt;AWS Mainframe Modernization – Data Replication for IBM z/OS&lt;/a&gt; AMI directly to Amazon S3 and then using &lt;a href="https://aws.amazon.com/s3/features/tables/" target="_blank" rel="noopener noreferrer"&gt;Amazon S3 Tables&lt;/a&gt; for advanced analytics capabilities.&lt;/p&gt; 
&lt;p&gt;By introducing direct S3 replication and streamlining deployment through the pre-configured AWS Marketplace AMI, you can deploy in minutes rather than weeks. This creates new possibilities for data distribution, transformation, and consumption. This architecture offers several key benefits:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;&lt;strong&gt;Simplified deployment&lt;/strong&gt; – Accelerate implementation using the preconfigured AWS Marketplace AMI&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Direct replication&lt;/strong&gt; – Eliminate intermediary systems by streaming data directly to Amazon S3, reducing latency and operational overhead&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Real-time synchronization&lt;/strong&gt; – Capture changes as they occur on the mainframe, ensuring downstream applications operate on current data&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Flexible analytics options&lt;/strong&gt; – Use S3 Tables for Iceberg-compatible tabular data storage&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Comprehensive AWS integration&lt;/strong&gt; – Gain immediate access to Amazon EMR, Amazon Athena, AWS Glue, Amazon SageMaker AI, and Amazon Quick Sight&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Natural language data access&lt;/strong&gt; – Through the &lt;a href="https://github.com/awslabs/mcp/tree/main/src/s3-tables-mcp-server" target="_blank" rel="noopener noreferrer"&gt;MCP Server for Amazon S3 Tables&lt;/a&gt;, AI assistants can interact with structured data using conversational interfaces without needing to write SQL queries.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;h2&gt;Prerequisites&lt;/h2&gt; 
&lt;p&gt;To complete the solution, you need the following prerequisites:&lt;/p&gt; 
&lt;h3&gt;Precisely components&lt;/h3&gt; 
&lt;ol&gt; 
 &lt;li&gt;&lt;a href="https://aws.amazon.com/marketplace/pp/prodview-sjnpyprulmohs" target="_blank" rel="noopener noreferrer"&gt;AWS Mainframe Modernization – Data Replication for IBM z/OS&lt;/a&gt; – Deploy this Precisely Connect AMI from AWS Marketplace. This pre-configured image contains the Apply Engine and Controller Daemon components required for replicating mainframe data changes to Amazon S3.&lt;/li&gt; 
 &lt;li&gt;Precisely Connect CDC Capture/Publisher – Deploy the Precisely Connect CDC Capture/Publisher on your mainframe environment. This component captures changes from Db2z logs and streams them to the Apply Engine over TCP/IP.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;For detailed setup and configuration steps for Precisely components, refer to our previous post &lt;a href="https://aws.amazon.com/blogs/big-data/stream-mainframe-data-to-aws-in-near-real-time-with-precisely-and-amazon-msk/" target="_blank" rel="noopener noreferrer"&gt;Stream mainframe data to AWS in near-real time with Precisely and Amazon MSK&lt;/a&gt;.&lt;/p&gt; 
&lt;h3&gt;Connectivity requirements&lt;/h3&gt; 
&lt;ol&gt; 
 &lt;li&gt;Have network connectivity established between your mainframe environment and AWS using your organization’s approved connectivity method (such as &lt;a href="https://aws.amazon.com/directconnect/" target="_blank" rel="noopener noreferrer"&gt;AWS Direct Connect&lt;/a&gt; or VPN).&lt;/li&gt; 
 &lt;li&gt;Verify that firewall rules allow TCP/IP communication between the mainframe Capture/Publisher and the Apply Engine.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;h3&gt;AWS analytics components (optional extension)&lt;/h3&gt; 
&lt;p&gt;After mainframe data lands in Amazon S3, your organization can extend its analytics capabilities using AWS services. One approach is to use Amazon EMR streaming jobs to process and write data to &lt;a href="https://aws.amazon.com/s3/features/tables/" target="_blank" rel="noopener noreferrer"&gt;Amazon S3 Tables&lt;/a&gt;. After the data is stored in S3 Tables, the data can be queried directly using &lt;a href="https://aws.amazon.com/athena/" target="_blank" rel="noopener noreferrer"&gt;Amazon Athena&lt;/a&gt; for ad-hoc SQL analysis. This extension is optional and represents one of several ways to consume and analyze mainframe data after it reaches Amazon S3.&lt;/p&gt; 
&lt;p&gt;The following diagram illustrates the solution architecture.&lt;/p&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/28/image-BDB-5540-1.png"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90677" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/28/image-BDB-5540-1.png" alt="image-BDB-5540-1-architecture" width="1598" height="625"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;&lt;strong&gt;Capture/Publisher&lt;/strong&gt; – Connect CDC Capture/Publisher captures Db2 changes from Db2 logs using IFI 306 Read and communicates captured data changes to a target engine through TCP/IP.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Controller Daemon&lt;/strong&gt; – The Controller Daemon authenticates all connection requests, managing secure communication between the source and target environments.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Apply Engine&lt;/strong&gt; – The Apply Engine receives the changes from the Publisher agent and applies the changed data to the target Amazon S3.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Amazon S3&lt;/strong&gt; – Serves as the scalable data lake foundation where replicated mainframe data lands.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Amazon EMR streaming job&lt;/strong&gt; – As data arrives, an instance of the Amazon EMR streaming job writes the data to target tables in Amazon S3 Tables.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Amazon Athena&lt;/strong&gt; – Queries data stored in Amazon S3 Tables using standard SQL.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;This architecture provides a clean separation between the data capture process and the data consumption process, allowing each to scale independently. When CDC data arrives in Amazon S3, you can use Amazon S3 Tables to store Db2 z/OS, VSAM, and IMS data in an open table format (Apache Iceberg) that is ready for analytics, providing a flexible path to mainframe modernization.&lt;/p&gt; 
&lt;h2&gt;Quantifiable business value&lt;/h2&gt; 
&lt;p&gt;Organizations implementing this solution typically see significant reductions in mainframe operational costs by offloading analytics and reporting workloads to the cloud. The elimination of intermediary infrastructure reduces both capital and operational expenses. The reduced maintenance burden frees IT resources to focus on strategic initiatives rather than managing complex replication systems. Speed and agility improvements are equally significant. Near real-time data availability, measured in seconds to minutes rather than hours to days, enables organizations to respond rapidly to market changes and operational events. The rapid deployment of new analytics use cases without requiring mainframe changes accelerates innovation. Organizations gain access to the full breadth of AWS services that can be used immediately after data lands in Amazon S3.&lt;/p&gt; 
&lt;p&gt;From an analytics and AI perspective, the solution creates a unified data platform that brings together mainframe, cloud-native, and third-party data sources. This unified view enables advanced machine learning on historical and current data, delivering predictive insights that drive proactive decision-making across the organization.&lt;/p&gt; 
&lt;h2&gt;Customer story&lt;/h2&gt; 
&lt;p&gt;A leading global payments provider put this into practice. The payments provider was struggling to generate timely analytics and insights from Point of Sale (POS) transaction data. As one of the world’s largest payment providers, they process hundreds of thousands of transactions per second. Users expect to swipe their card and have their transaction approved in seconds. New architecture was needed to keep up with customer demands and volume. By streaming mission-critical mainframe data directly to AWS in real time using Precisely Connect and landing it in Amazon S3 Tables, the company used storage built on the Apache Iceberg open standard. This approach enables high-performance analytics directly on mainframe data alongside cloud-native sources.&lt;/p&gt; 
&lt;h2&gt;Conclusion&lt;/h2&gt; 
&lt;p&gt;In this post, we demonstrated how Precisely Connect enables real-time, direct data replication from mainframes to Amazon S3, eliminating intermediaries and simplifying mainframe modernization.&lt;/p&gt; 
&lt;p&gt;Your organization can further extend this foundation with &lt;a href="https://aws.amazon.com/s3/features/tables/" target="_blank" rel="noopener noreferrer"&gt;Amazon S3 Tables&lt;/a&gt;, purpose-built storage for Apache Iceberg tables in S3, enabling analytical applications to query the most current mainframe data using tools such as Amazon Athena, Amazon EMR, and Amazon Redshift.&lt;/p&gt; 
&lt;p&gt;Get started by deploying &lt;a href="https://aws.amazon.com/marketplace/pp/prodview-sjnpyprulmohs" target="_blank" rel="noopener noreferrer"&gt;AWS Mainframe Modernization – Data Replication for IBM z/OS&lt;/a&gt; from AWS Marketplace and use Amazon S3 as a target for your mainframe use cases. Learn more about Precisely’s mainframe data integration capabilities at &lt;a href="http://www.precisely.com" target="_blank" rel="noopener noreferrer"&gt;precisely.com&lt;/a&gt;. Contact &lt;a href="https://aws.amazon.com/contact-us/" target="_blank" rel="noopener noreferrer"&gt;AWS&lt;/a&gt; and Precisely experts to discuss your specific modernization challenges and design a proof-of-concept that demonstrates business value quickly.&lt;/p&gt; 
&lt;hr&gt; 
&lt;h2&gt;About the authors&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/28/image-BDB-5540-2.png"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90680" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/28/image-BDB-5540-2.png" alt="image-BDB-5540-2" width="100" height="107"&gt;&lt;/a&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Supreet Padhi&lt;/h3&gt; 
  &lt;p&gt;Supreet is a Technology Architect at Precisely. He has been with Precisely for more than 14 years, with specialty in streaming data use cases and technology, with emphasis on data warehouse architecture. He is responsible for research and development in areas such as Change Data Capture (CDC), streaming ETL, metadata management, and VectorDBs.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/28/image-BDB-5540-3-1.png"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90680" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/28/image-BDB-5540-3-1.png" alt="image-BDB-5540-3" width="100" height="107"&gt;&lt;/a&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Rochelle Grubbs&lt;/h3&gt; 
  &lt;p&gt;Rochelle is a Senior Director and Solution Architect for Precisely’s Data Integration solutions and has been with Precisely for over 11 years. She has spent the last several years focusing on databases, analytics, data trends, data integration, and GenAI. Rochelle is an expert on Precisely’s OEM AWS Mainframe Migration offering and is driven to help customers successfully migrate their applications and workloads to the cloud.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/28/image-BDB-5540-4.png"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90680" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/28/image-BDB-5540-4.png" alt="image-BDB-5540-4" width="100" height="107"&gt;&lt;/a&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Tamara Astakhova&lt;/h3&gt; 
  &lt;p&gt;Tamara is a Sr. Partner Solutions Architect in Data and Analytics at AWS with over two decades of expertise in architecting and developing large-scale data analytics systems. In her current role, she collaborates with strategic partners to design and implement sophisticated AWS-optimized architectures. Her deep technical knowledge and experience make her an invaluable resource in helping organizations transform their data infrastructure and analytics capabilities.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
					
			
		
		
			</item>
		<item>
		<title>Build streaming applications on Amazon Managed Service for Apache Flink with AI-assisted guidance</title>
		<link>https://aws.amazon.com/blogs/big-data/build-streaming-applications-on-amazon-managed-service-for-apache-flink-with-ai-assisted-guidance/</link>
					
		
		<dc:creator><![CDATA[Mazrim Mehrtens]]></dc:creator>
		<pubDate>Wed, 06 May 2026 15:45:57 +0000</pubDate>
				<category><![CDATA[Amazon Managed Service for Apache Flink]]></category>
		<category><![CDATA[Analytics]]></category>
		<category><![CDATA[Announcements]]></category>
		<category><![CDATA[Technical How-to]]></category>
		<guid isPermaLink="false">e939b9e157bb989186598da2f70b5d75d0ac8981</guid>

					<description>In this post, we walk through installing the Power and Skill, using Amazon Kinesis Data Streams to build a Kinesis Data Stream-to-Kinesis Data Stream streaming pipeline, and migrating an existing application to Flink 2.2. You can follow along with this use case to see how the Managed Service for Apache Flink Kiro Power can help you build a resilient, performant application grounded in best practices.</description>
										<content:encoded>&lt;p&gt;Building production-ready &lt;a href="https://flink.apache.org/" target="_blank" rel="noopener noreferrer"&gt;Apache Flink&lt;/a&gt; applications requires learning a complex ecosystem. The learning curve is steep for newcomers, and even experienced Flink developers encounter complexity when scaling applications or troubleshooting production issues. With the new &lt;a href="https://kiro.dev" target="_blank" rel="noopener noreferrer"&gt;Kiro&lt;/a&gt; &lt;a href="https://kiro.dev/powers/" target="_blank" rel="noopener noreferrer"&gt;Power&lt;/a&gt; and &lt;a href="https://agentskills.io/home" target="_blank" rel="noopener noreferrer"&gt;Agent Skill&lt;/a&gt; for &lt;a href="https://docs.aws.amazon.com/managed-flink/latest/java/what-is.html" target="_blank" rel="noopener noreferrer"&gt;Amazon Managed Service for Apache Flink&lt;/a&gt;, you can get AI-assisted guidance for building, improving, and migrating streaming applications directly in your development environment, with recommendations that are grounded in best practices.&lt;/p&gt; 
&lt;p&gt;The Managed Service for Apache Flink Kiro Power and Agent Skill helps you navigate challenges across the Flink application lifecycle. For new development, the tool provides contextual guidance on application architecture, state management patterns, and connector selection. For existing application improvements, it analyzes your existing code to identify performance bottlenecks, reliability risks, and opportunities for improvement. If you’re &lt;a href="https://docs.aws.amazon.com/managed-flink/latest/java/flink-2-2-upgrade-guide.html" target="_blank" rel="noopener noreferrer"&gt;upgrading from Apache Flink 1.x to 2.x&lt;/a&gt;, it detects compatibility issues and provides targeted refactoring steps to modernize your applications.&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="wp-image-90694 size-full aligncenter" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/29/BDB-5913-1-resize.png" alt="" width="600" height="1084"&gt;&lt;/p&gt; 
&lt;p&gt;In this post, we walk through installing the Power and Skill, using &lt;a href="https://aws.amazon.com/kinesis/" target="_blank" rel="noopener noreferrer"&gt;Amazon Kinesis Data Streams&lt;/a&gt; to build a Kinesis Data Stream-to-Kinesis Data Stream streaming pipeline, and migrating an existing application to Flink 2.2. You can follow along with this use case to see how the Managed Service for Apache Flink Kiro Power can help you build a resilient, performant application grounded in best practices.&lt;/p&gt; 
&lt;h2&gt;Solution overview&lt;/h2&gt; 
&lt;p&gt;The &lt;a href="https://github.com/awslabs/managed-service-for-apache-flink-agent-steering-files" target="_blank" rel="noopener noreferrer"&gt;Managed Service for Apache Flink Power/Skill&lt;/a&gt; works across multiple AI development tools, providing the same comprehensive guidance in each:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Kiro&lt;/strong&gt;: Installs as a Power that automatically activates for Flink-related development activities&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://cursor.com/en-US/docs" target="_blank" rel="noopener noreferrer"&gt;&lt;strong&gt;Cursor&lt;/strong&gt;&lt;/a&gt;&lt;strong&gt; and &lt;/strong&gt;&lt;a href="https://code.claude.com/docs/en/overview" target="_blank" rel="noopener noreferrer"&gt;&lt;strong&gt;Claude Code&lt;/strong&gt;&lt;/a&gt;: Installs as an Agent Skill following the open Agent Skills standard&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Other compatible agents&lt;/strong&gt;: Compatible with tools supporting the Agent Skills specification&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;The Power/Skill provides guidance across the development lifecycle:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Best practices for Managed Service for Apache Flink application development&lt;/li&gt; 
 &lt;li&gt;Maven dependency management and project structure&lt;/li&gt; 
 &lt;li&gt;Resource improvements including KPU sizing, parallelism tuning, and checkpointing&lt;/li&gt; 
 &lt;li&gt;Job graph architecture patterns and anti-patterns&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://aws.amazon.com/cloudwatch/" target="_blank" rel="noopener noreferrer"&gt;Amazon CloudWatch&lt;/a&gt; monitoring and logging configuration&lt;/li&gt; 
 &lt;li&gt;Flink 1.x to 2.2 migration guidance with state compatibility assessment&lt;/li&gt; 
 &lt;li&gt;Connector-specific guidelines&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;The content is maintained in a single repository with use case specific entry points that are dynamically loaded depending on your needs.&lt;/p&gt; 
&lt;h2&gt;Prerequisites&lt;/h2&gt; 
&lt;p&gt;To use the tool, you need:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;A development machine running macOS, Linux, or Windows with Java 11 or later (Java 17 for Flink 2.2) and Apache Maven installed&lt;/li&gt; 
 &lt;li&gt;One of the following AI development tools: 
  &lt;ul&gt; 
   &lt;li&gt;Kiro IDE&lt;/li&gt; 
   &lt;li&gt;Cursor&lt;/li&gt; 
   &lt;li&gt;Claude Code&lt;/li&gt; 
   &lt;li&gt;Other Agent Skills-compatible tools&lt;/li&gt; 
  &lt;/ul&gt; &lt;/li&gt; 
 &lt;li&gt;Basic knowledge of Java and stream processing concepts (helpful but not required)&lt;/li&gt; 
 &lt;li&gt;An &lt;a href="https://aws.amazon.com/iam/" target="_blank" rel="noopener noreferrer"&gt;AWS Identity and Access Management&lt;/a&gt; (IAM) role configured with access to create and run Managed Service for Apache Flink applications, create Amazon Simple Storage Service (Amazon S3) buckets for Flink application dependencies, create Kinesis Data Streams for streaming, and create IAM roles (required if deploying an application)&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;Installation&lt;/h2&gt; 
&lt;h3&gt;Installing as a Kiro Power&lt;/h3&gt; 
&lt;ol&gt; 
 &lt;li&gt;Open Kiro IDE.&lt;/li&gt; 
 &lt;li&gt;Open &lt;a href="https://kiro.dev/launch/powers/amazon-managed-service-for-apache-flink-power/"&gt;Amazon Managed Service for Apache Flink&lt;/a&gt; and select&amp;nbsp;&lt;strong&gt;Open in Kiro.&lt;/strong&gt;&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90620" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/27/BDB-5913-open-in-kiro.png" alt="" width="2896" height="1560"&gt;&lt;/p&gt; 
&lt;ol start="3"&gt; 
 &lt;li&gt;Choose&amp;nbsp;&lt;strong&gt;Install&lt;/strong&gt; to install the power.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90621" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/27/BDB-5913-install-power.png" alt="" width="2238" height="1064"&gt;&lt;/p&gt; 
&lt;ol start="4"&gt; 
 &lt;li&gt;Verify that the power is listed in the installed powers in the Kiro IDE.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90634" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/27/BDB-5913-installed-power-1.png" alt="" width="2736" height="1158"&gt;&lt;/p&gt; 
&lt;p&gt;The Power is now installed and automatically activates when you work on Flink-related development activities.&lt;/p&gt; 
&lt;h3&gt;Installing as an Agent Skill&lt;/h3&gt; 
&lt;p&gt;Agent Skills are discovered automatically by compatible tools through the SKILL.md file. Installation varies by tool:&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Per-project installation&lt;/strong&gt; (available in one project):&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;# For Cursor
git clone https://github.com/awslabs/managed-service-for-apache-flink-agent-steering-files.git .cursor/skills/flink

# For Claude Code
git clone https://github.com/awslabs/managed-service-for-apache-flink-agent-steering-files.git .claude/skills/flink

# For other Agent Skills-compatible tools
git clone https://github.com/awslabs/managed-service-for-apache-flink-agent-steering-files.git .agents/skills/flink&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;&lt;strong&gt;Personal installation&lt;/strong&gt; (available across projects):&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;# For Cursor
git clone https://github.com/awslabs/managed-service-for-apache-flink-agent-steering-files.git ~/.cursor/skills/flink

# For Claude Code
git clone https://github.com/awslabs/managed-service-for-apache-flink-agent-steering-files.git ~/.claude/skills/flink&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;To verify the installation, interact with the skill in your preferred tool. In Claude Code, you can invoke it with /flink. In Cursor, type / in Agent chat and search for flink. For more information about Agent Skills, see the &lt;a href="https://agentskills.io/home" target="_blank" rel="noopener noreferrer"&gt;Agent Skills documentation&lt;/a&gt;.&lt;/p&gt; 
&lt;h2&gt;Example: Building a Kinesis-to-Kinesis streaming pipeline&lt;/h2&gt; 
&lt;p&gt;Rather than listing best practices, the Power/Skill actively guides you through making the right architectural decisions at each stage of development.&lt;/p&gt; 
&lt;p&gt;The following walkthrough demonstrates building a Flink application that reads from &lt;a href="https://aws.amazon.com/kinesis/" target="_blank" rel="noopener noreferrer"&gt;Amazon Kinesis Data Streams&lt;/a&gt;, analyzes events, and writes to another Kinesis stream. To follow along, run the same prompts in your Kiro IDE or other development tool. In the following prompts, we focus on local development and don’t create AWS resources. However, if you prompt the agent to create and deploy AWS resources, they will incur additional costs.&lt;/p&gt; 
&lt;h3&gt;Starting the conversation&lt;/h3&gt; 
&lt;p&gt;In the Kiro IDE, we can open a new chat in Vibe mode and prompt: &lt;em&gt;“Help me build a Flink application that reads from Kinesis, processes events with windowed aggregations, and writes results to another Kinesis stream”:&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone wp-image-90465 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/BDB-5913-5.png" alt="Kiro chat showing a prompt to build a Kinesis streaming application" width="2066" height="1626"&gt;&lt;/p&gt; 
&lt;h3&gt;What happens next&lt;/h3&gt; 
&lt;p&gt;The AI assistant loads relevant guidance and walks you through the development process:&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;1. Confirm project requirements and details&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;Kiro automatically loads the Power based on the context of your prompt. The assistant then asks you questions about your use case to make sure that it builds the right application for your needs:&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90637" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/27/BDB-5913-6-1.png" alt="" width="1304" height="2212"&gt;&lt;/p&gt; 
&lt;p&gt;For the demo, we can prompt for a financial services use case: &lt;em&gt;“I’m in financial services, so let’s use that as the use case. Try calculating volatility in real-time. And let’s use Flink 1.20 for now.”. &lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;Kiro then confirms its assumptions and asks to proceed:&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90467" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/BDB-5913-7.png" alt="" width="1982" height="934"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;2. Project setup&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;After we confirm, Kiro generates a project with Flink 1.20 dependencies, Kinesis connectors, and proper scope configuration for Managed Service for Apache Flink deployment. The assistant creates the application structure with proper configuration separation between local development and Managed Service for Apache Flink service-level settings. Then, it creates a Kinesis source with proper deserialization and the sink with partitioning strategy, and windowed aggregation logic with proper &lt;a href="https://docs.aws.amazon.com/managed-flink/latest/java/troubleshooting-checkpoints.html" target="_blank" rel="noopener noreferrer"&gt;state management&lt;/a&gt;, &lt;a href="https://docs.aws.amazon.com/managed-flink/latest/java/troubleshooting-rt-stateleaks.html" target="_blank" rel="noopener noreferrer"&gt;TTL configuration&lt;/a&gt;, and &lt;a href="https://docs.aws.amazon.com/managed-flink/latest/java/troubleshooting-source-throttling.html" target="_blank" rel="noopener noreferrer"&gt;error handling&lt;/a&gt;.&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone wp-image-90468 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/BDB-5913-8.png" alt="Generated project structure with Flink dependencies and Kinesis connectors" width="1974" height="1788"&gt;&lt;/p&gt; 
&lt;p&gt;Kiro also compiles the code to verify that it builds correctly. We can then proceed by asking Kiro to help us with running the application locally for testing.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;3. Testing the project locally&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;You can run the application locally to test the results. We can prompt: &lt;em&gt;“Can we run this locally using something like LocalStack to test deploying the job and also see some example results?”&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;Kiro creates the necessary Docker resources, testing scripts, and deployment steps to run the application locally with synthetic resources. If it encounters bugs or detects issues during the local testing process, it fixes them so that your deployment runs smoothly:&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone wp-image-90469 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/BDB-5913-9.png" alt="Kiro creating Docker resources and local testing infrastructure" width="1464" height="1656"&gt;&lt;/p&gt; 
&lt;p&gt;We can also access our local Flink UI to view our application:&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone wp-image-90470 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/BDB-5913-10.png" alt="Local Flink UI showing the running streaming application" width="3204" height="1898"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;4. Deploying the application to Managed Service for Apache Flink&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;Now that our application is running and generating results end-to-end, we can use the Power for other tasks. For example, you can get guidance on KPU allocation and parallelism settings based on your expected throughput, configure monitoring with CloudWatch metrics, logging, and dashboards for operational visibility, or set up infrastructure as code (IaC) for deploying in Managed Service for Apache Flink. We can prompt: &lt;em&gt;“This is great! Can you help me deploy this application to Managed Service for Apache Flink? I’d like to use CloudFormation for deployment.”&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone wp-image-90471 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/BDB-5913-11.png" alt="Kiro conversation summarizing creation of CloudFormation deployment resources" width="1770" height="1716"&gt;&lt;/p&gt; 
&lt;p&gt;Using the generated &lt;a href="https://aws.amazon.com/cloudformation/" target="_blank" rel="noopener noreferrer"&gt;AWS CloudFormation&lt;/a&gt; templates and deployment scripts, we can deploy our application to AWS with associated resources for Kinesis Data Streams, &lt;a href="https://aws.amazon.com/s3/" target="_blank" rel="noopener noreferrer"&gt;Amazon S3&lt;/a&gt; buckets for application JAR files, CloudWatch log groups, and IAM roles. Deploying these resources requires IAM credentials with associated permissions and will incur cost for the associated resource usage.&lt;/p&gt; 
&lt;p&gt;In a traditional workflow, you build your application, deploy to Managed Service for Apache Flink, then discover performance issues or configuration problems in production. You spend time debugging checkpoint failures, serialization errors, or resource bottlenecks.With the Power/Skill, the AI assistant catches these issues during development. When you need complex aggregation and processing logic, it helps you to do so in a way that uses resources efficiently with Flink’s scaling model. When you create an application bug that would cause a crash in production, it helps you identify it early with local end-to-end testing. The Power is configured with guidance and best practices to help with the development process from start to finish.&lt;/p&gt; 
&lt;h2&gt;Example: Migrating to Flink 2.2&lt;/h2&gt; 
&lt;p&gt;The Managed Service for Apache Flink Kiro Power and Agent Skill provide contextual advice specific to your situation. For new developers, it walks through the complete workflow from project setup to deployment, explaining Managed Service for Apache Flink-specific concepts along the way. For migration projects, it analyzes your existing code for Flink 2.2 compatibility issues and provides targeted refactoring guidance. The following example shows how the tool helps with the complex task of migrating from Flink 1.x to 2.2.&lt;/p&gt; 
&lt;h3&gt;1. Assessing migration compatibility&lt;/h3&gt; 
&lt;p&gt;We can ask Kiro to help us upgrade our project from the previous example to Flink 2.2&lt;em&gt;: “I need to migrate my Flink 1.x application to 2.2. Can you help me identify compatibility issues?”&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;The assistant loads the Managed Service for Apache Flink Kiro Power and analyzes our code to identify potential issues:&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone wp-image-90472 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/BDB-5913-12.png" alt="Kiro analyzing Flink 1.x code for 2.2 compatibility issues" width="2322" height="1738"&gt;&lt;/p&gt; 
&lt;p&gt;In this case, using our generated project on Flink 1.20, Kiro identified the following compatibility issues for the upgrade:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Java 11 must move to Java 17 (minimum for Flink 2.2)&lt;/li&gt; 
 &lt;li&gt;Flink version 1.20.3 must update to 2.2.0&lt;/li&gt; 
 &lt;li&gt;The Kinesis connector must update from 5.1.0-1.20 to 6.0.0-2.0&lt;/li&gt; 
 &lt;li&gt;Time references must change to java.time.Duration in window and lateness calls&lt;/li&gt; 
 &lt;li&gt;The LocalStreamEnvironment instance of check must be removed (class removed in 2.2)&lt;/li&gt; 
 &lt;li&gt;The isEndOfStream() override must be dropped from PriceTickDeserializer (method removed)&lt;/li&gt; 
 &lt;li&gt;implements Serializable must be added to PriceTick and VolatilityResult&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;It also verified that some parts of the project are already Flink 2.2 compatible. The project uses the new Source Sink V2 APIs, the logging is 2.2 ready, the POJOs with no collection fields are state migration safe, and there are no Kryo registrations or TimeCharacteristic usage.&lt;/p&gt; 
&lt;h3&gt;2. Implementing the migration&lt;/h3&gt; 
&lt;p&gt;We can then ask Kiro to provide a step-by-step migration plan, both for updating the code and deploying to Managed Service for Apache Flink: &lt;em&gt;“Can you help me update the application for Flink 2.2, and help me figure out the steps to upgrade my running Managed Service for Apache Flink application?”&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;Kiro evaluates the entire application code base. It evaluates it against the Power’s migration guidance and best practices, and provides a comprehensive analysis of the breaking changes, risks, and potential issues that would arise in the upgrade. After we approve the changes, Kiro then proceeds to make the necessary updates to make our application compatible with Flink 2.2 and provide us with a step-by-step upgrade process for the running application:&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone wp-image-90473 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/BDB-5913-13.png" alt="Kiro providing a step-by-step migration plan for Flink 2.2" width="2488" height="1592"&gt;&lt;/p&gt; 
&lt;p&gt;Now that Kiro has prepared the application for Flink 2.2, highlighted migration risks, and provided us with a clear path to execute the upgrade, you can test the upgrade process with confidence. From here, we can proceed to run our Flink 2.2 application locally, test the upgrade process in a development environment in Managed Service for Apache Flink, and then execute the upgrade in our production environment. If we run into issues, we can return to the Kiro Power to get advice, resolve issues, and unblock our upgrade.&lt;/p&gt; 
&lt;h2&gt;Cleanup&lt;/h2&gt; 
&lt;p&gt;To remove the Power/Skill installation:&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;For Kiro:&lt;/strong&gt;&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;Open Kiro IDE.&lt;/li&gt; 
 &lt;li&gt;Navigate to the &lt;strong&gt;Powers&lt;/strong&gt; tab.&lt;/li&gt; 
 &lt;li&gt;Uninstall the &lt;strong&gt;Amazon Managed Service for Apache Flink&lt;/strong&gt; Power.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;strong&gt;For Agent Skills:&lt;/strong&gt;&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;# Remove per-project installation
rm -rf .cursor/skills/flink  # or .claude/skills/flink

# Remove personal installation
rm -rf ~/.cursor/skills/flink  # or ~/.claude/skills/flink
If you created Managed Service for Apache Flink applications or associated resources during development, clean the resources up:&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;ol start="4"&gt; 
 &lt;li&gt;Delete the Managed Service for Apache Flink application from the AWS Console.&lt;/li&gt; 
 &lt;li&gt;Remove associated resources for sources and sinks, if created for development.&lt;/li&gt; 
 &lt;li&gt;Delete CloudWatch log groups if no longer needed.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;h2&gt;Conclusion&lt;/h2&gt; 
&lt;p&gt;In this post, we showed you how the Kiro Power and Agent Skill for Amazon Managed Service for Apache Flink brings AI-assisted development to stream processing. You can use the tool to overcome Flink’s learning curve, build applications following Managed Service for Apache Flink best practices, and migrate to Flink 2.2 with confidence. To get started, choose the path that fits your workflow:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;If you use Kiro, install the Power from the Powers tab and start a new chat with a Flink-related prompt.&lt;/li&gt; 
 &lt;li&gt;If you use Cursor, Claude Code, or another Agent Skills-compatible tool, clone the &lt;a href="https://github.com/aws-samples/kiro-powers-apache-flink" target="_blank" rel="noopener noreferrer"&gt;GitHub repository&lt;/a&gt; into your skills directory and reference the steering/ files for guidance.&lt;/li&gt; 
 &lt;li&gt;If you are new to Amazon Managed Service for Apache Flink, review the &lt;a href="https://docs.aws.amazon.com/managed-flink/latest/java/what-is.html" target="_blank" rel="noopener noreferrer"&gt;Amazon Managed Service for Apache Flink Developer Guide&lt;/a&gt; and the &lt;a href="https://nightlies.apache.org/flink/flink-docs-stable/" target="_blank" rel="noopener noreferrer"&gt;Apache Flink documentation&lt;/a&gt; to build foundational knowledge alongside the Power/Skill.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;We welcome your feedback. Report issues or request features through &lt;a href="https://github.com/awslabs/managed-service-for-apache-flink-agent-steering-files/issues" target="_blank" rel="noopener noreferrer"&gt;GitHub Issues&lt;/a&gt;, or contribute improvements via pull requests.&lt;/p&gt; 
&lt;hr style="width: 80%"&gt; 
&lt;h2&gt;About the authors&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignnone size-full wp-image-89475" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/03/24/bdb-5775-mmehrten-headshot.png" alt="" width="100" height="107"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Mazrim Mehrtens&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/mmehrtens/" target="_blank" rel="noopener"&gt;Mazrim&lt;/a&gt; is a Sr. Specialist Solutions Architect for messaging and streaming workloads. Mazrim works with customers to build and support systems that process and analyze terabytes of streaming data in real time, run enterprise Machine Learning pipelines, and create systems to share data across teams seamlessly with varying data toolsets and software stacks.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
					
			
		
		
			</item>
		<item>
		<title>Migrating TLS Clients managed by third-party Certificate Authorities from self-managed Apache Kafka to Amazon MSK</title>
		<link>https://aws.amazon.com/blogs/big-data/migrating-tls-clients-managed-by-third-party-certificate-authorities-from-self-managed-apache-kafka-to-amazon-msk/</link>
					
		
		<dc:creator><![CDATA[Ali Alemi]]></dc:creator>
		<pubDate>Wed, 06 May 2026 15:41:21 +0000</pubDate>
				<category><![CDATA[Amazon Managed Streaming for Apache Kafka (Amazon MSK)]]></category>
		<category><![CDATA[Analytics]]></category>
		<category><![CDATA[Intermediate (200)]]></category>
		<guid isPermaLink="false">125f60f817573749f223950d8146ad840813a80a</guid>

					<description>In this post, we provide an approach to reuse your existing client certificates without reissuing them through AWS Certificate Manager (ACM) Private Certificate Authority. This solution enables an accelerated migration path by using your current third-party CA infrastructure. This removes the complexity and operational overhead of certificate re-issuance while maintaining the security posture that you've established with your existing mTLS implementation.</description>
										<content:encoded>&lt;p&gt;&lt;a href="https://aws.amazon.com/msk/" target="_blank" rel="noopener noreferrer"&gt;Amazon Managed Streaming for Apache Kafka (Amazon MSK)&lt;/a&gt; is a fully managed streaming data service that handles &lt;a href="https://kafka.apache.org/" target="_blank" rel="noopener noreferrer"&gt;Apache Kafka&lt;/a&gt; infrastructure and operations, so developers and DevOps managers can run Apache Kafka applications on AWS. Migrating to Amazon MSK requires no application code changes because Amazon MSK uses fully open source Apache Kafka, allowing existing applications and tools to work seamlessly. &lt;a href="https://docs.aws.amazon.com/msk/latest/developerguide/msk-broker-types-express.html" target="_blank" rel="noopener noreferrer"&gt;Amazon MSK with Express brokers&lt;/a&gt; streamlines Kafka management by providing up to 3x more throughput, 20x faster scaling, and 180x faster recovery with virtually unlimited storage, delivering resiliency and elasticity for mission-critical workloads.&lt;/p&gt; 
&lt;p&gt;&lt;a href="https://docs.aws.amazon.com/msk/latest/developerguide/kafka_apis_iam.html" target="_blank" rel="noopener noreferrer"&gt;Amazon MSK supports multiple authentication methods&lt;/a&gt; to secure client connections to Kafka clusters. These methods include:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/msk/latest/developerguide/iam-access-control.html" target="_blank" rel="noopener noreferrer"&gt;IAM authentication&lt;/a&gt; for identity-based access control using AWS Identity and Access Management (IAM) policies.&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/msk/latest/developerguide/msk-authentication.html" target="_blank" rel="noopener noreferrer"&gt;Mutual TLS (mTLS) authentication&lt;/a&gt; where both clients and brokers authenticate each other using X.509 certificates.&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/msk/latest/developerguide/msk-password.html" target="_blank" rel="noopener noreferrer"&gt;SASL/SCRAM authentication&lt;/a&gt; for username and password-based authentication stored in AWS Secrets Manager.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;When customers manage their own Kafka clusters and adopt mTLS, they typically rely on a third-party managed certificate authority (CA) to sign and verify both client and server certificates. This establishes a trust relationship where the CA acts as the trusted intermediary that validates the identity of both parties in the communication. When customers migrate their workloads to Amazon MSK, they must make sure that client certificates are signed by a CA that’s recognized and trusted by the MSK cluster. Amazon MSK recommends customers to use &lt;a href="https://docs.aws.amazon.com/privateca/latest/userguide/PcaWelcome.html" target="_blank" rel="noopener noreferrer"&gt;AWS Private Certificate Authority&lt;/a&gt; to create a private CA within AWS that MSK trusts. The migration path typically requires customers to either:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;Generate new client certificates signed by an AWS Private CA that Amazon MSK recognizes, or&lt;/li&gt; 
 &lt;li&gt;Establish a certificate chain where their existing third-party CA is subordinate to or trusted by an AWS-managed CA&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;In this post, we provide an approach to reuse your existing client certificates without reissuing them through AWS Certificate Manager (ACM) Private Certificate Authority. This solution enables an accelerated migration path by using your current third-party CA infrastructure. This removes the complexity and operational overhead of certificate re-issuance while maintaining the security posture that you’ve established with your existing mTLS implementation.&lt;/p&gt; 
&lt;h2&gt;Solution overview&lt;/h2&gt; 
&lt;p&gt;This approach involves four key steps to reuse your existing client certificates when migrating to Amazon MSK:&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;1. Create an Intermediate Certificate Using Your Third-Party CA&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;First, you generate an intermediate certificate authority (CA) certificate using your existing third-party CA infrastructure. This intermediate certificate acts as a bridge between your current certificate management system and AWS.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;2. Import the Intermediate Certificate into AWS Certificate Manager as a Private CA&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;Next, you import this intermediate certificate into AWS Certificate Manager (ACM) as a Private Certificate Authority (PCA). This step establishes the intermediate CA within the AWS environment, making it recognizable to AWS services.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;3. Integrate Amazon MSK with the PCA created from your Intermediate Certificate&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;You then configure your Amazon MSK cluster to use the ACM Private CA that contains your imported intermediate certificate. This integration enables Amazon MSK to recognize and trust certificates signed by your certificate authority.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;4. Establish trust through common Certificate Authority&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;This approach works because both the AWS Private CA and your existing client certificates share the same root of trust—they’re both signed by your third-party CA. When Amazon MSK validates client certificates, it can trace the certificate chain back through the intermediate certificate in AWS Private CA to your trusted third-party CA, establishing a complete chain of trust without requiring certificate reissuance.This solution maintains your existing security architecture while enabling seamless migration to Amazon MSK, so your clients can continue using their current certificates without interruption.&lt;/p&gt; 
&lt;div id="attachment_90636" style="width: 1100px" class="wp-caption aligncenter"&gt;
 &lt;img aria-describedby="caption-attachment-90636" loading="lazy" class="wp-image-90636 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/27/image-1-10.png" alt="" width="1090" height="668"&gt;
 &lt;p id="caption-attachment-90636" class="wp-caption-text"&gt;Figure 1: Architecture diagram showing the integration of third-party Certificate Authority with Amazon MSK through AWS Certificate Manager Private CA&lt;/p&gt;
&lt;/div&gt; 
&lt;h2&gt;Implementation steps&lt;/h2&gt; 
&lt;p&gt;In real-world scenarios, you already have a certificate authority that has issued certificates for your clients. For the purpose of this post, we use a &lt;a href="https://github.com/aws-samples/msk-third-party-mtls" target="_blank" rel="noopener noreferrer"&gt;code sample&lt;/a&gt; to create a self-signed certificate authority (using OpenSSL) to demonstrate the implementation steps. If you already have an existing certificate authority, you don’t need to create a root CA. You can generate an intermediate CA (Step 2) using your third-party CA and continue following the steps from where you import the intermediate CA certificate into AWS ACM as a Private Certificate Authority.&lt;/p&gt; 
&lt;h3&gt;&lt;strong&gt;Step 1: Create a root Certificate Authority using OpenSSL&lt;/strong&gt;&lt;/h3&gt; 
&lt;p&gt;&lt;strong&gt;Cloning the repository&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;To clone the repository, complete the following steps:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;&lt;strong&gt;Clone the repository&lt;/strong&gt; using the following command:&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;code&gt;git clone https://github.com/aws-samples/msk-third-party-mtls&lt;/code&gt;&lt;/p&gt; 
&lt;ol start="2"&gt; 
 &lt;li&gt;&lt;strong&gt;Change to the repository’s root directory&lt;/strong&gt;:&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;code&gt;cd ./msk-third-party-mtls/openssl&lt;/code&gt;&lt;/p&gt; 
&lt;ol start="3"&gt; 
 &lt;li&gt;&lt;strong&gt;Run the setup script&lt;/strong&gt;:&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;make the script executable first:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-shell"&gt;chmod +x *.sh
./setup-ca.sh&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;You will be prompted to set up a password for the private key and the certificate. Here is an example of an output&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="aligncenter wp-image-90635 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/27/image-2-8.png" alt="" width="1020" height="589"&gt;&lt;/p&gt; 
&lt;h3&gt;&lt;strong&gt;Step 2: Create an intermediate CA for AWS ACM&lt;/strong&gt;&lt;/h3&gt; 
&lt;ol&gt; 
 &lt;li&gt;In the AWS Private CA console, create a subordinate CA.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img loading="lazy" class="aligncenter wp-image-90633 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/27/image-3-7.png" alt="" width="824" height="510"&gt;&lt;/p&gt; 
&lt;ol start="2"&gt; 
 &lt;li&gt;Enter distinguished name information matching your organization, Key algorithm and Create CA.&lt;/li&gt; 
 &lt;li&gt;From the &lt;strong&gt;Actions&lt;/strong&gt; menu, select &lt;strong&gt;Install CA certificate&lt;/strong&gt;.&lt;/li&gt; 
 &lt;li&gt;Download the Certificate Signing Request (CSR) file provided by AWS Private CA.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img loading="lazy" class="aligncenter wp-image-90632 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/27/image-4-7.png" alt="" width="1432" height="573"&gt;&lt;/p&gt; 
&lt;ol start="5"&gt; 
 &lt;li&gt;Download the CSR file to your local directory (“certs”) as “CSR.pem”.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img loading="lazy" class="aligncenter wp-image-90631 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/27/image-5-7.png" alt="" width="1251" height="895"&gt;&lt;/p&gt; 
&lt;ol start="6"&gt; 
 &lt;li&gt;Sign the ACM PCA issued CSR with your Root CA using the provided &lt;code&gt;./sign-acm-ca.sh&lt;/code&gt; in the code example.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img loading="lazy" class="aligncenter wp-image-90630 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/27/image-6-4.png" alt="" width="1126" height="497"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; AWS Private CA retains the private key internally. You only sign their CSR and import the resulting certificate back to the AWS Private CA.&lt;/p&gt; 
&lt;h3&gt;&lt;strong&gt;Step 3: Import signed certificate to AWS ACM Private CA&lt;/strong&gt;&lt;/h3&gt; 
&lt;ol&gt; 
 &lt;li&gt;Go back to the AWS ACM console.&lt;/li&gt; 
 &lt;li&gt;Select the CA that you created and select &lt;strong&gt;Install CA certificate&lt;/strong&gt;.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img loading="lazy" class="wp-image-90629 size-full alignnone" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/27/image-7-4.png" alt="" width="330" height="366"&gt;&lt;/p&gt; 
&lt;ol start="3"&gt; 
 &lt;li&gt;Select External private CA as CA type.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img loading="lazy" class="aligncenter wp-image-90628 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/27/image-8-4.png" alt="" width="1433" height="192"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Importing the certificate into AWS Certificate Manager&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="aligncenter wp-image-90627 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/27/image-9-3.png" alt="" width="819" height="671"&gt;&lt;/p&gt; 
&lt;p&gt;Open both files in a text editor:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;code&gt;acm-subordinate-ca-cert.pem&lt;/code&gt;&lt;/li&gt; 
 &lt;li&gt;&lt;code&gt;acm-ca-chain.pem&lt;/code&gt;&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;Do the following in the &lt;strong&gt;Certificate body&lt;/strong&gt; field in AWS ACM:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Copy the &lt;strong&gt;entire content&lt;/strong&gt; from the &lt;code&gt;acm-subordinate-ca-cert.pem&lt;/code&gt; file and paste it into the text box.&lt;/li&gt; 
 &lt;li&gt;Open the &lt;code&gt;acm-ca-chain.pem&lt;/code&gt; file.&lt;/li&gt; 
 &lt;li&gt;This file contains &lt;strong&gt;one certificate&lt;/strong&gt; (The root CA certificate)&lt;/li&gt; 
 &lt;li&gt;Do the following in the &lt;strong&gt;Certificate chain&lt;/strong&gt; field in AWS ACM:&lt;/li&gt; 
 &lt;li&gt;Copy&lt;strong&gt; the root CA certificate portion and p&lt;/strong&gt;aste it into the text box&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;&lt;strong&gt;Important:&lt;/strong&gt; The certificate chain shouldn’t include the subordinate CA certificate itself—only the certificates above it in the chain (the root CA).&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Choose &lt;strong&gt;Confirm and install&lt;/strong&gt; to complete the process.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;You should see the AWS Private CA turns into active state.&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="aligncenter wp-image-90626 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/27/image-10-3.png" alt="" width="1430" height="367"&gt;&lt;/p&gt; 
&lt;h3&gt;Step 4: Configure your MSK cluster for Mutual TLS authentication&lt;/h3&gt; 
&lt;ol&gt; 
 &lt;li&gt;Select your MSK cluster, go to&amp;nbsp;&lt;strong&gt;Properties&lt;/strong&gt;&amp;nbsp;and edit the&amp;nbsp;&lt;strong&gt;Security settings&lt;/strong&gt;.&lt;/li&gt; 
 &lt;li&gt;Select&amp;nbsp;&lt;strong&gt;TLS client authentication through AWS Certificate Manager (ACM)&lt;/strong&gt;&amp;nbsp;as the access control method and choose the Subordinate CA that you created earlier. Then choose &lt;strong&gt;Save changes&lt;/strong&gt;.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img loading="lazy" class="aligncenter wp-image-90625 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/27/image-11-3.png" alt="" width="798" height="485"&gt;&lt;/p&gt; 
&lt;h3&gt;Step 5: Test your client&lt;/h3&gt; 
&lt;p&gt;&lt;strong&gt;Run the certificate generation script&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;Execute the following command, replacing &amp;lt;client-name&amp;gt; with a descriptive name for your client (this will be used in the certificate filename):&lt;code&gt;./generate-client-cert.sh &amp;lt;client-name&amp;gt;&lt;/code&gt;&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;&lt;code&gt;./generate-client-cert.sh kafka-admin&lt;/code&gt;&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Enter distinguished name information&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;When prompted, enter the distinguished name (DN) options. These should &lt;strong&gt;match your root CA settings&lt;/strong&gt; except for the Common Name (CN):&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Country (C):&lt;/strong&gt; Match your root CA (for example, US)&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;State (ST):&lt;/strong&gt; Match your root CA (for example, State)&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Organization (O):&lt;/strong&gt; Match your root CA (for example, Anycompany)&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Organizational Unit (OU):&lt;/strong&gt; Match your root CA (for example, IT)&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Common Name (CN):&lt;/strong&gt; Use a &lt;strong&gt;client-specific identifier&lt;/strong&gt; (for example, kafka-admin or client)&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;&lt;strong&gt;Verify certificate files&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;After the certificate is generated, verify that the files were created successfully by running:&lt;code&gt;ls ~/ca/certs&lt;/code&gt;You should see files with your client name, including:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;code&gt;&amp;lt;client-name&amp;gt;.key&lt;/code&gt; (private key)&lt;/li&gt; 
 &lt;li&gt;&lt;code&gt;&amp;lt;client-name&amp;gt;.crt&lt;/code&gt; (certificate)&lt;/li&gt; 
 &lt;li&gt;&lt;code&gt;&amp;lt;client-name&amp;gt;.p12&lt;/code&gt; (PKCS12 keystore)&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;&lt;strong&gt;Create Kafka client properties file&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;Create a new properties file for your Kafka client (for example, &lt;code&gt;kafka-tls-client.properties&lt;/code&gt;) based on the provided &lt;code&gt;kafka-admin-ssl.properties&lt;/code&gt; example file. Update the file paths to reference your newly generated client certificate files.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Example configuration:&lt;/strong&gt;&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;security.protocol=SSL
ssl.keystore.location=/path/to/&amp;lt;client-name&amp;gt;.p12
ssl.keystore.password=your-keystore-password
ssl.key.password=your-key-password #omit if you didn’t set key password
ssl.keystore.alias=your-private-key-alias&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;h3&gt;Step 6: Testing the Kafka client connection&lt;/h3&gt; 
&lt;p&gt;To test the Kafka client connection, do the following.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Set environment variables&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;First, set the required environment variables for your Kafka installation and MSK cluster:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-javascript"&gt;export KAFKA_HOME=/home/ec2-user/kafka
export BOOTSTRAP_SERVERS=&amp;lt;your-msk-bootstrap-servers&amp;gt;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Replace &amp;lt;your-msk-bootstrap-servers&amp;gt; with your actual Amazon MSK cluster bootstrap server endpoints (for example, b-1.mycluster.abc123.kafka.us-east-1.amazonaws.com:9094,b-2.mycluster.abc123.kafka.us-east-1.amazonaws.com:9094)&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Run the Kafka list topics command&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;Execute the following command to verify that your client can successfully connect to Amazon MSK using mutual TLS authentication:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;$KAFKA_HOME/bin/kafka-topics.sh \
  --bootstrap-server $BOOTSTRAP_SERVERS \
  --list \
  --command-config kafka-tls-client.properties&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;&lt;strong&gt;What this test does:&lt;/strong&gt;&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Connects to your Amazon MSK cluster using the TLS configuration in your properties file&lt;/li&gt; 
 &lt;li&gt;Authenticates using your client certificate&lt;/li&gt; 
 &lt;li&gt;Lists all available Kafka topics&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;&lt;strong&gt;Expected result:&lt;/strong&gt; If successful, you should see a list of topics in your Kafka cluster (or an empty list if no topics exist yet).&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="aligncenter wp-image-90624 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/27/image-12-2.png" alt="" width="1284" height="243"&gt;&lt;/p&gt; 
&lt;p&gt;If the connection fails, check:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Your bootstrap server endpoints are correct&lt;/li&gt; 
 &lt;li&gt;You imported the private key, and certificate chain to your keystore&lt;/li&gt; 
 &lt;li&gt;The paths in your properties file point to the correct keystore and truststore files&lt;/li&gt; 
 &lt;li&gt;Your client certificate was properly imported&lt;/li&gt; 
 &lt;li&gt;Your Amazon MSK cluster security settings allow TLS client authentication&lt;/li&gt; 
 &lt;li&gt;Your Amazon MSK cluster references correct PCA ARN in AWS ACM&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;Troubleshooting&lt;/h2&gt; 
&lt;h3&gt;Enable debug mode to verify certificate handshake&lt;/h3&gt; 
&lt;p&gt;To troubleshoot certificate issues and verify which certificates are involved in the TLS handshake, enable Java SSL debug mode:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-javascript"&gt;export KAFKA_OPTS="-Djavax.net.debug=ssl:handshake:verbose"
$KAFKA_HOME/bin/kafka-topics.sh \
  --bootstrap-server $BOOTSTRAP_SERVERS \
  --list \
  --command-config kafka-tls-client.properties&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;&lt;strong&gt;What this debug mode shows:&lt;/strong&gt;&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;The complete TLS handshake process&lt;/li&gt; 
 &lt;li&gt;Which certificates are being presented by both client and server&lt;/li&gt; 
 &lt;li&gt;The certificate chain validation steps&lt;/li&gt; 
 &lt;li&gt;Which certificate from your truststore is being used for authentication&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;&lt;img loading="lazy" class="aligncenter wp-image-90623 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/27/image-13-3.png" alt="" width="1270" height="474"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;When this is helpful:&lt;/strong&gt;&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;When you have multiple certificates in your truststore and need to identify which one is being used&lt;/li&gt; 
 &lt;li&gt;When troubleshooting certificate chain validation issues&lt;/li&gt; 
 &lt;li&gt;When verifying that the correct client certificate is being presented during authentication&lt;/li&gt; 
 &lt;li&gt;When diagnosing certificate mismatch or trust issues&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h3&gt;&lt;strong&gt;Reading the debug output:&lt;/strong&gt;&lt;/h3&gt; 
&lt;p&gt;Look for lines containing:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;***Certificate chain – Shows the certificates being presented&lt;/li&gt; 
 &lt;li&gt;Found trusted certificate – Indicates which certificate in your truststore matched&lt;/li&gt; 
 &lt;li&gt;Cert path validation – Shows the certificate chain validation process&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;&lt;strong&gt;To disable debug mode&lt;/strong&gt; after troubleshooting, simply unset the environment variable:&lt;/p&gt; 
&lt;p&gt;&lt;code&gt;unset KAFKA_OPTS&lt;/code&gt;&lt;/p&gt; 
&lt;h2&gt;Conclusion&lt;/h2&gt; 
&lt;p&gt;This post presents a solution for migrating TLS clients from self-managed Apache Kafka to Amazon MSK while reusing existing third-party CA-signed certificates. The approach removes the need for certificate reissuance by instead creating an intermediate CA from the existing third-party CA, importing it into AWS Certificate Manager as a Private CA, and integrating it with Amazon MSK. This maintains the established chain of trust through the common certificate authority, enabling seamless migration without operational disruption while preserving the existing security architecture and mTLS implementation. To read more about the Amazon MSK security model, see &lt;a href="https://docs.aws.amazon.com/msk/latest/developerguide/security.html" target="_blank" rel="noopener noreferrer"&gt;Security in Amazon MSK&lt;/a&gt;.&lt;/p&gt; 
&lt;hr style="width: 80%"&gt; 
&lt;h2&gt;About the authors&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="size-full wp-image-82865 alignleft" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2025/08/31/BDB-4572-Ali-Alemi.png" alt="Author Ali Alemi" width="100" height="100"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;“Ali Alemi”&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/ali-alemi-11869b53/" target="_blank" rel="noopener"&gt;“Ali”&lt;/a&gt; is a Principal Streaming Solutions Architect at AWS. Ali advises AWS customers with architectural best practices and helps them design real-time analytics data systems which are reliable, secure, efficient, and cost-effective. Prior to joining AWS, Ali supported several public sector customers and AWS consulting partners in their application modernization journey and migration to the Cloud.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="size-full wp-image-62362 alignleft" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2024/04/10/swapnaba-pic-2.jpeg" alt="" width="81" height="108"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;“Swapna Bandla”&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/swapnabandla/" target="_blank" rel="noopener"&gt;“Swapna”&lt;/a&gt; is a Senior Streaming Solutions Architect at AWS. With a deep understanding of real-time data processing and analytics, she partners with customers to architect scalable, cloud-native solutions that align with AWS Well-Architected best practices. Swapna is passionate about helping organizations unlock the full potential of their data to drive business value. Beyond her professional pursuits, she cherishes quality time with her family.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
					
			
		
		
			</item>
		<item>
		<title>A guide to capacity planning for Airflow worker pool in Amazon MWAA</title>
		<link>https://aws.amazon.com/blogs/big-data/a-guide-to-capacity-planning-for-airflow-worker-pool-in-amazon-mwaa/</link>
		
		<dc:creator><![CDATA[Boyko Radulov]]></dc:creator>
		<pubDate>Fri, 01 May 2026 15:42:45 +0000</pubDate>
				<category><![CDATA[Amazon Managed Workflows for Apache Airflow (Amazon MWAA)]]></category>
		<category><![CDATA[Best Practices]]></category>
		<category><![CDATA[Amazon Cloudwatch]]></category>
		<guid isPermaLink="false">250b8508de241c170b26475c4624ddc662cfa423</guid>

					<description>In our previous post, A guide to Airflow worker pool optimization in Amazon MWAA, we explored when adding workers to your Amazon Managed Workflows for Apache Airflow (Amazon MWAA) environment actually solves performance issues, and when it doesn’t. We walked through patterns like high CPU utilization and long queue times where scaling may be appropriate, […]</description>
										<content:encoded>&lt;p&gt;In our previous post, &lt;a href="https://aws.amazon.com/blogs/big-data/a-guide-to-airflow-worker-pool-optimization-in-amazon-mwaa/"&gt;A guide to Airflow worker pool optimization in Amazon MWAA&lt;/a&gt;, we explored when adding workers to your Amazon Managed Workflows for Apache Airflow (Amazon MWAA) environment actually solves performance issues, and when it doesn’t. We walked through patterns like high CPU utilization and long queue times where scaling may be appropriate, and anti-patterns like misconfigured Airflow settings and memory leaks where adding workers only masks the real problem. The key takeaway was clear: optimize first, scale second, and always let data drive the decision.&lt;/p&gt; 
&lt;p&gt;But what happens after you’ve done the optimization work? Your DAGs are efficient, your configurations are tuned, and your environment is running well. Then the business comes knocking: new regulatory requirements, additional data pipelines, expanded reporting. The workload is about to grow, and this time, you genuinely need more capacity.&lt;/p&gt; 
&lt;p&gt;This is where capacity planning comes in. Knowing how many workers to provision, before the new workload hits production, is the difference between a smooth rollout and a 5 AM SLA breach. In this post, we walk through a practical capacity planning framework for Amazon MWAA worker pools. Using a real-world financial services scenario, we show how to assess your current capacity, project future needs, calculate the right number of base workers, and set up monitoring to keep your environment healthy as workloads evolve.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Scenario:&lt;/strong&gt; A financial services company needs to plan capacity for a 25% directed acyclic graph (DAG) increase to support new regulatory reporting requirements.&lt;/p&gt; 
&lt;h2&gt;&lt;strong&gt;Current vs projected state&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;The following table compares the current and expected state after adding 25% more DAGs.&lt;/p&gt; 
&lt;p&gt;&amp;nbsp;&lt;/p&gt; 
&lt;table border="1px" cellpadding="10px"&gt; 
 &lt;tbody&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;/td&gt; 
   &lt;td&gt;Metric&lt;/td&gt; 
   &lt;td&gt;Current&lt;/td&gt; 
   &lt;td&gt;Projected&lt;/td&gt; 
   &lt;td&gt;Change&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;1&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;DAGs&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;20&lt;/td&gt; 
   &lt;td&gt;25&lt;/td&gt; 
   &lt;td&gt;25%&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;2&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Peak Tasks (5-7 AM)&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;80&lt;/td&gt; 
   &lt;td&gt;104&lt;/td&gt; 
   &lt;td&gt;+24 tasks&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;3&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Environment Class&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;mw1.medium&lt;/td&gt; 
   &lt;td&gt;mw1.medium&lt;/td&gt; 
   &lt;td&gt;No change&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;4&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Base Workers&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;8&lt;/td&gt; 
   &lt;td&gt;11&lt;/td&gt; 
   &lt;td&gt;+3 workers&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;5&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Tasks per Worker&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;10 (mw1.medium default)&lt;/td&gt; 
   &lt;td&gt;10&lt;/td&gt; 
   &lt;td&gt;No change&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;6&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Available Capacity&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;80 slots (8 × 10)&lt;/td&gt; 
   &lt;td&gt;110 slots (11 × 10)&lt;/td&gt; 
   &lt;td&gt;+30 slots&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;7&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Peak Utilization&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;100% (80/80 slots) &lt;img src="https://s.w.org/images/core/emoji/14.0.0/72x72/26a0.png" alt="⚠" class="wp-smiley" style="height: 1em; max-height: 1em;"&gt;&lt;/td&gt; 
   &lt;td&gt;95% (104/110 slots)&lt;/td&gt; 
   &lt;td&gt;Improved&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;8&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Critical SLA&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;7 AM market open&lt;/td&gt; 
   &lt;td&gt;7 AM market open&lt;/td&gt; 
   &lt;td&gt;No tolerance&lt;/td&gt; 
  &lt;/tr&gt; 
 &lt;/tbody&gt; 
&lt;/table&gt; 
&lt;p&gt;&lt;strong&gt;Capacity planning goal:&lt;/strong&gt; Reduce utilization from 100% to 95% to maintain service level agreement (SLA) compliance and handle unexpected spikes.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Understanding current capacity:&lt;/strong&gt; The environment currently runs 8 base workers, providing 80 concurrent task slots (8 workers × 10 tasks per worker). During the 5-7 AM peak with 80 concurrent tasks, this represents 100% utilization, a risky level that leaves no headroom for unexpected spikes or volatility.&lt;br&gt; With the planned addition of 5 new regulatory reporting DAGs, peak concurrent tasks will grow to 104. To maintain healthy operations with adequate buffer, we need to increase to 11 base workers (110 slots), resulting in 95% peak utilization with 6 slots of breathing room.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Why 100% utilization is risky: &lt;/strong&gt;Running at 100% task utilization means:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Zero buffer for unexpected spikes&lt;/li&gt; 
 &lt;li&gt;Any additional task causes immediate queuing&lt;/li&gt; 
 &lt;li&gt;No room for market volatility or data volume increases&lt;/li&gt; 
 &lt;li&gt;High risk of SLA breaches during unpredictable events&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;&lt;strong&gt;Best practice: Maintain at least 5-15% headroom (85-95% utilization) for production workloads with critical SLAs.&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Why this sizing:&lt;/strong&gt;&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Current:&lt;/strong&gt; 80 tasks ÷ 80 slots = 100% utilization (at capacity – risky!)&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Projected:&lt;/strong&gt; 104 tasks ÷ 110 slots = 95% utilization (healthy with buffer)&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Buffer:&lt;/strong&gt; 6 slots (5% headroom) protects against unexpected volatility spikes&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;SLA protection:&lt;/strong&gt; Adequate headroom prevents queuing during normal operations&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;&lt;strong&gt;Capacity analysis&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;Every team asks the same critical question: &lt;strong&gt;“How many workers do I need&lt;/strong&gt;?” The process is to identify your peak concurrent tasks from &lt;a href="https://docs.aws.amazon.com/mwaa/latest/userguide/accessing-metrics-cw-container-queue-db.html" target="_blank" rel="noopener noreferrer"&gt;Amazon CloudWatch metrics,&lt;/a&gt; dividing by your environment’s tasks-per-worker capacity, and adding a 5%-15% safety buffer.&lt;/p&gt; 
&lt;h3&gt;&lt;strong&gt;Step 1: Identifying peak concurrent tasks from Amazon CloudWatch&lt;/strong&gt;&lt;/h3&gt; 
&lt;p&gt;To determine your peak workload, you need to analyze RunningTasks and QueuedTasks CloudWatch metrics for your Amazon MWAA environment. Navigate to Amazon CloudWatch and query the following key metrics:&lt;/p&gt; 
&lt;h4&gt;&lt;strong&gt;Primary metrics for capacity planning:&lt;/strong&gt;&lt;/h4&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;RunningTasks:&lt;/strong&gt; Number of tasks currently executing across all workers. This shows your actual concurrent task load.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;QueuedTasks:&lt;/strong&gt; Number of tasks waiting for available worker slots. High values indicate insufficient capacity.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;AvailableWorkers:&lt;/strong&gt; Current number of active workers in your environment.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;&lt;strong&gt;How to find peak concurrent tasks:&lt;/strong&gt;&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;Open the Amazon CloudWatch Console. 
  &lt;ul&gt; 
   &lt;li&gt;Choose &lt;strong&gt;Metrics&lt;/strong&gt;.&lt;/li&gt; 
   &lt;li&gt;Choose the &lt;strong&gt;MWAA &lt;/strong&gt;namespace.&lt;/li&gt; 
  &lt;/ul&gt; &lt;/li&gt; 
 &lt;li&gt;Select your environment name.&lt;/li&gt; 
 &lt;li&gt;Add the &lt;code&gt;RunningTasks&lt;/code&gt; metric.&lt;/li&gt; 
 &lt;li&gt;Set time range to last 7-30 days.&lt;/li&gt; 
 &lt;li&gt;Change statistic to &lt;strong&gt;Maximum&lt;/strong&gt;.&lt;/li&gt; 
 &lt;li&gt;Identify the highest value during your peak hours (for example, 5-7 AM).&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;strong&gt;Example query:&lt;/strong&gt;&lt;br&gt; Note: The following query is conceptual and does not directly translate to Amazon CloudWatch-specific language. Please refer to the &lt;a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/query_with_cloudwatch-metrics-insights.html" target="_blank" rel="noopener noreferrer"&gt;Query your CloudWatch metrics with CloudWatch Metrics Insights&lt;/a&gt; for more information.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-sql"&gt;SELECT MAX(RunningTasks) AS PeakConcurrentTasks
FROM MWAA_Metrics
WHERE Environment = 'prod-airflow'
  AND timestamp BETWEEN '2024-10-01' AND '2024-10-31'
  AND HOUR(timestamp) BETWEEN 5 AND 7;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;In our scenario, this analysis revealed &lt;strong&gt;80 concurrent tasks&lt;/strong&gt; during the 5-7 AM window. With the planned 25% DAG increase, we project this will grow to &lt;strong&gt;104 concurrent tasks&lt;/strong&gt;.&lt;/p&gt; 
&lt;h3&gt;Step 2: Calculate required workers&lt;/h3&gt; 
&lt;p&gt;To calculate the number of required workers without queuing any tasks, use the following formula: &lt;strong&gt;Peak concurrent tasks ÷ Tasks per worker × Safety buffer = Required workers&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;In the projected scenario with 104 tasks at peak hours, using mw1.medium environment with default concurrency configuration and having a 5% safety buffer, we need 11 workers&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;104 peak tasks ÷ 10 tasks per worker × 1.06 buffer = 11 workers required to handle your workload without queuing during busiest periods.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;Capacity monitoring and triggers&lt;/h2&gt; 
&lt;p&gt;There are a few important Amazon CloudWatch metrics to monitor for environment health.&lt;/p&gt; 
&lt;h3&gt;Key metrics to monitor&lt;/h3&gt; 
&lt;p&gt;Monitor these five critical Amazon CloudWatch metrics to detect capacity issues:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;QueuedTasks (&amp;gt;10 for &amp;gt;5 minutes indicates insufficient capacity)&lt;/li&gt; 
 &lt;li&gt;RunningTasks (consistently at maximum suggests the need for more workers)&lt;/li&gt; 
 &lt;li&gt;AdditionalWorkers (active for more than 6 hours daily signals the permanent worker problem)&lt;/li&gt; 
 &lt;li&gt;Worker CPU (&amp;gt;85% sustained requires environment class upgrade or workload optimization)&lt;/li&gt; 
 &lt;li&gt;Task Duration (+15% increase means reduced effective capacity per worker).&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;These metrics provide early warning signals to adjust capacity before SLA breaches occur.&lt;/p&gt; 
&lt;p&gt;&amp;nbsp;&lt;/p&gt; 
&lt;table border="1px" cellpadding="10px"&gt; 
 &lt;tbody&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;/td&gt; 
   &lt;td&gt;Metric&lt;/td&gt; 
   &lt;td&gt;Threshold&lt;/td&gt; 
   &lt;td&gt;Action&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;1&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;QueuedTasks&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&amp;gt;10 for &amp;gt;5 minutes&lt;/td&gt; 
   &lt;td&gt;Investigate capacity&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;2&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;RunningTasks&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;Consistently at max&lt;/td&gt; 
   &lt;td&gt;Increase base workers&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;3&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;AdditionalWorkers&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;Active &amp;gt;6 hours daily&lt;/td&gt; 
   &lt;td&gt;Increase base workers&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;4&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Worker CPU&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&amp;gt;85% sustained&lt;/td&gt; 
   &lt;td&gt;Upgrade environment class&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;5&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Task Duration&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;+15% increase&lt;/td&gt; 
   &lt;td&gt;Review capacity per worker&lt;/td&gt; 
  &lt;/tr&gt; 
 &lt;/tbody&gt; 
&lt;/table&gt; 
&lt;h3&gt;Amazon CloudWatch monitoring queries&lt;/h3&gt; 
&lt;p&gt;Note: The following queries are conceptual and do not directly translate to Amazon CloudWatch-specific language. Please refer to the &lt;a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/query_with_cloudwatch-metrics-insights.html" target="_blank" rel="noopener noreferrer"&gt;Query your CloudWatch metrics with CloudWatch Metrics Insights&lt;/a&gt; for more information.&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Queue depth during peak hours 
  &lt;div class="hide-language"&gt; 
   &lt;pre&gt;&lt;code class="lang-sql"&gt;SELECT AVG(QueuedTasks)
FROM MWAA_Metrics
WHERE Environment = 'prod-airflow'
  AND timestamp BETWEEN '05:00' AND '07:00'
GROUP BY 5m;&lt;/code&gt;&lt;/pre&gt; 
  &lt;/div&gt; &lt;/li&gt; 
 &lt;li&gt;Worker utilization efficiency 
  &lt;div class="hide-language"&gt; 
   &lt;pre&gt;&lt;code class="lang-sql"&gt;SELECT AVG(RunningTasks) / AVG(AvailableWorkers * 5) * 100 AS UtilizationPercent
FROM MWAA_Metrics
WHERE Environment = 'prod-airflow';&lt;/code&gt;&lt;/pre&gt; 
  &lt;/div&gt; &lt;/li&gt; 
 &lt;li&gt;Detect permanent worker problem 
  &lt;div class="hide-language"&gt; 
   &lt;pre&gt;&lt;code class="lang-sql"&gt;SELECT DATE(timestamp) AS date,
       AVG(AdditionalWorkers) AS avg_additional,
       MAX(AdditionalWorkers) AS max_additional
FROM MWAA_Metrics
WHERE AdditionalWorkers &amp;gt; 0
GROUP BY DATE(timestamp)
HAVING AVG(AdditionalWorkers) &amp;gt; 5;&lt;/code&gt;&lt;/pre&gt; 
  &lt;/div&gt; &lt;/li&gt; 
&lt;/ul&gt; 
&lt;h3&gt;&lt;strong&gt;Setting up alerts&lt;/strong&gt;&lt;/h3&gt; 
&lt;p&gt;You can configure these alarms to identify problems as soon as they are introduced.&lt;/p&gt; 
&lt;h4&gt;&lt;strong&gt;Recommended Amazon CloudWatch alarms:&lt;/strong&gt;&lt;/h4&gt; 
&lt;ol&gt; 
 &lt;li&gt;&lt;strong&gt;High queue depth alert&lt;/strong&gt; 
  &lt;ul&gt; 
   &lt;li&gt;Metric: QueuedTasks&lt;/li&gt; 
   &lt;li&gt;Threshold: &amp;gt; 10 for 2 consecutive 5-minute periods&lt;/li&gt; 
   &lt;li&gt;Action: Notify operations team&lt;/li&gt; 
  &lt;/ul&gt; &lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Permanent worker detection&lt;/strong&gt; 
  &lt;ul&gt; 
   &lt;li&gt;Metric: AdditionalWorkers&lt;/li&gt; 
   &lt;li&gt;Threshold: &amp;gt; 0 for 6+ hours&lt;/li&gt; 
   &lt;li&gt;Action: Review capacity planning&lt;/li&gt; 
  &lt;/ul&gt; &lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;SLA risk alert&lt;/strong&gt; 
  &lt;ul&gt; 
   &lt;li&gt;Metric: QueuedTasks during 5-7 AM window&lt;/li&gt; 
   &lt;li&gt;Threshold: &amp;gt; 5 tasks&lt;/li&gt; 
   &lt;li&gt;Action: Page on-call engineer&lt;/li&gt; 
  &lt;/ul&gt; &lt;/li&gt; 
&lt;/ol&gt; 
&lt;h3&gt;&lt;strong&gt;When to revisit capacity planning&lt;/strong&gt;&lt;/h3&gt; 
&lt;p&gt;Conduct quarterly scheduled reviews to analyze trends and project growth. Also run immediate trigger-based assessments when:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;DAG count increases &amp;gt;10% (or more than your safety buffer)&lt;/li&gt; 
 &lt;li&gt;Performance degrades&lt;/li&gt; 
 &lt;li&gt;Cost anomalies appear (indicating permanent workers)&lt;/li&gt; 
 &lt;li&gt;Any SLA breach occurs.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;This dual approach provides proactive capacity management while enabling rapid response to emerging issues.&lt;/p&gt; 
&lt;p&gt;&amp;nbsp;&lt;/p&gt; 
&lt;table border="1px" cellpadding="10px"&gt; 
 &lt;tbody&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;/td&gt; 
   &lt;td&gt;Trigger&lt;/td&gt; 
   &lt;td&gt;Frequency&lt;/td&gt; 
   &lt;td&gt;Action&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;1&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Scheduled Review&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;Quarterly&lt;/td&gt; 
   &lt;td&gt;Analyze trends, project growth&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;2&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;DAG Growth&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&amp;gt;10% increase&lt;/td&gt; 
   &lt;td&gt;Recalculate capacity needs&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;3&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Performance Degradation&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;As observed&lt;/td&gt; 
   &lt;td&gt;Immediate capacity assessment&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;4&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Cost Anomalies&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;Monthly&lt;/td&gt; 
   &lt;td&gt;Check for permanent workers&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;5&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;SLA Breaches&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;Any occurrence&lt;/td&gt; 
   &lt;td&gt;Emergency capacity review&lt;/td&gt; 
  &lt;/tr&gt; 
 &lt;/tbody&gt; 
&lt;/table&gt; 
&lt;h2&gt;&lt;strong&gt;Decision matrix&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;The framework presents three capacity planning approaches, each optimized for different organizational priorities.&lt;/p&gt; 
&lt;p&gt;The &lt;strong&gt;Full Base Worker Provisioning strategy&lt;/strong&gt; (the conservative path) sets base workers equal to the calculated requirement, eliminating queue times during peak periods and guaranteeing SLA compliance with predictable fixed costs, while automatic scaling handles only unexpected spikes—ideal for mission-critical workloads with strict SLA requirements.&lt;/p&gt; 
&lt;p&gt;The &lt;strong&gt;Minimal Base + Automatic Scaling approach&lt;/strong&gt; (the cost-focused path) maintains minimal base workers at current levels and relies heavily on automatic scaling, accepting 3-5 minute delays during peak periods and SLA breach risks in exchange for lower baseline costs, though this requires intensive monitoring and carries explicit warnings about high SLA risk.&lt;/p&gt; 
&lt;p&gt;The &lt;strong&gt;Hybrid Approach&lt;/strong&gt; (the balanced path) provisions base workers at 80% of the calculated requirement with automatic scaling covering the remaining 20%, resulting in 2-3 minute delays during spikes while balancing cost against performance—suitable for moderate SLA requirements with some budget constraints.&lt;/p&gt; 
&lt;p&gt;The comparison table contrasts queue times (under 30 seconds versus 2-3 minutes versus 3-5 minutes), SLA compliance levels (guaranteed versus high probability versus at-risk during peak), and ideal use cases (mission-critical predictable workloads versus moderate SLA requirements with budget constraints versus development environments with flexible SLA tolerance), enabling teams to make informed provisioning decisions aligned with their operational requirements and financial constraints.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/29/BDB-4941-1.jpeg"&gt;&lt;/p&gt; 
&lt;h2&gt;&lt;strong&gt;Key takeaway&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;Effective capacity planning prevents both under-provisioning (SLA breaches) and over-provisioning (cost overruns).&lt;/p&gt; 
&lt;h3&gt;&lt;strong&gt;Capacity planning principles&lt;/strong&gt;&lt;/h3&gt; 
&lt;ol&gt; 
 &lt;li&gt;&lt;strong&gt;Calculate capacity needs BEFORE adding workload&lt;/strong&gt; – Use peak task projections with 5-15% safety buffer&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Size minimum workers for peak demand&lt;/strong&gt; – Don’t rely on automatic scaling for predictable loads&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Use automatic scaling only for unexpected spikes&lt;/strong&gt; – Treat as safety net, not primary capacity&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Target 85-95% utilization during peak hours&lt;/strong&gt; – Ensures headroom for unexpected growth&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Plan 5-15% headroom for unexpected growth&lt;/strong&gt; – Production often differs from testing&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Monitor AdditionalWorkers metric&lt;/strong&gt; – If active &amp;gt;6 hours daily, increase base workers&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Review quarterly + trigger-based assessments&lt;/strong&gt; – Regular reviews plus immediate action on issues&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Balance cost and performance based on SLA criticality&lt;/strong&gt; – Business impact justifies infrastructure investment&lt;/li&gt; 
&lt;/ol&gt; 
&lt;h3&gt;&lt;strong&gt;Success metrics&lt;/strong&gt;&lt;/h3&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Queue efficiency:&lt;/strong&gt; Average queue time &amp;lt;30 seconds during peak&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;SLA compliance:&lt;/strong&gt; &amp;gt;99.5% of critical tasks complete on time&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Resource utilization:&lt;/strong&gt; 85-95% during peak hours (optimal efficiency)&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Cost predictability:&lt;/strong&gt; &amp;lt;10% variance in monthly worker costs&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;Capacity planning is not a one-time exercise. It’s an ongoing discipline. The framework we’ve outlined gives you a repeatable process: measure your current peak utilization through CloudWatch metrics, project growth based on incoming workloads, calculate the required workers with an appropriate safety buffer, and monitor continuously to catch drift before it becomes an outage.&lt;/p&gt; 
&lt;p&gt;The financial services scenario in this post illustrates a common reality: running at 100% utilization during peak hours leaves zero room for the unexpected. By sizing to 95% peak utilization with a modest buffer, the team gained the headroom needed to absorb volatility without risking their 7 AM market-open SLA.&lt;/p&gt; 
&lt;p&gt;Whether you choose full base worker provisioning for mission-critical pipelines, a hybrid approach for moderate SLA requirements, or lean on automatic scaling for development workloads, the right strategy depends on your business context, not a one-size-fits-all rule. Pair your capacity plan with the CloudWatch alarms and review triggers we covered, and you’ll catch capacity gaps early.&lt;/p&gt; 
&lt;p&gt;Combined with the optimization-first approach from Part 1, you now have a complete toolkit: diagnose before you scale, optimize before you provision, and plan before you deploy. Your MWAA environment and your on-call engineers will thank you.&lt;/p&gt; 
&lt;p&gt;To get started, visit the &lt;a href="https://aws.amazon.com/managed-workflows-for-apache-airflow/" target="_blank" rel="noopener noreferrer"&gt;Amazon MWAA product page&lt;/a&gt; and the &lt;a href="https://console.aws.amazon.com/mwaa/home" target="_blank" rel="noopener noreferrer"&gt;Amazon MWAA console page&lt;/a&gt;.&lt;/p&gt; 
&lt;p&gt;If you have questions or want to share your MWAA capacity planning, leave a comment.&lt;/p&gt; 
&lt;h2&gt;About the authors&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="aligncenter size-full wp-image-29797" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/29/BDB-4941-2.jpeg" alt="Boyko Radulov" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Boyko Radulov&lt;/h3&gt; 
  &lt;p&gt;Boyko is a Senior Cloud Support Engineer at Amazon Web Services (AWS), Amazon MWAA and AWS Glue Subject Matter Expert. He works closely with customers to build and optimize their workloads on AWS while reducing the overall cost. Beyond work, he is passionate about sports and travelling.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="aligncenter size-full wp-image-29797" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/29/BDB-4941-3.jpeg" alt="Kamen Sharlandjiev" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Kamen Sharlandjiev&lt;/h3&gt; 
  &lt;p&gt;Kamen is a Principal Big Data and ETL Solutions Architect, Amazon MWAA and AWS Glue ETL expert. He’s on a mission to make life easier for customers who are facing complex data integration and orchestration challenges. His secret weapon? Fully managed AWS services that can get the job done with minimal effort. Follow Kamen on &lt;a href="https://www.linkedin.com/in/ksharlandjiev/" target="_blank" rel="noopener noreferrer"&gt;&lt;em&gt;LinkedIn&lt;/em&gt;&lt;/a&gt; to keep up to date with the latest Amazon MWAA and AWS Glue features and news.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="aligncenter size-full wp-image-29797" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/29/BDB-4941-4.jpeg" alt="Venu Thangalapally" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Venu Thangalapally&lt;/h3&gt; 
  &lt;p&gt;Venu is a Senior Solutions Architect at AWS, based in Chicago, with deep expertise in cloud architecture, data and analytics, containers, and application modernization. He partners with financial service industry customers to translate business goals into secure, scalable, and compliant cloud solutions that deliver measurable value. Venu is passionate about using technology to drive innovation and operational excellence.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="aligncenter size-full wp-image-29797" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/29/BDB-4941-5.jpeg" alt="Harshawardhan Kulkarni" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Harshawardhan Kulkarni&lt;/h3&gt; 
  &lt;p&gt;Harshawardhan is a Partner Technical Account Manager at AWS, Amazon MWAA Subject Matter Expert. Based in Dublin Ireland, he partners with Enterprise Customers across EMEA to help navigate complex workflows and orchestration challenges while ensuring best practice implementation. Outside of work, he enjoys traveling and spending time with his family.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="aligncenter size-full wp-image-29797" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/29/BDB-4941-6.jpeg" alt="Andrew McKenzie" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Andrew McKenzie&lt;/h3&gt; 
  &lt;p&gt;Andrew is a Data Engineer and Educator who uses deep technical expertise from his time at AWS. As a former Amazon MWAA Subject Matter Expert, he now focuses on building data solutions and teaching data engineering best practices.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
		
		
			</item>
		<item>
		<title>A guide to Airflow worker pool optimization in Amazon MWAA</title>
		<link>https://aws.amazon.com/blogs/big-data/a-guide-to-airflow-worker-pool-optimization-in-amazon-mwaa/</link>
					
		
		<dc:creator><![CDATA[Boyko Radulov]]></dc:creator>
		<pubDate>Fri, 01 May 2026 15:41:26 +0000</pubDate>
				<category><![CDATA[Best Practices]]></category>
		<category><![CDATA[Amazon EMR]]></category>
		<category><![CDATA[Analytics]]></category>
		<category><![CDATA[AWS Glue]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[EMR]]></category>
		<guid isPermaLink="false">5a370db59f8f679e52ac60136a4bcab3d33ca08d</guid>

					<description>Optimizing the Airflow worker pool configuration in Amazon Managed Workflows for Apache Airflow (Amazon MWAA), the AWS fully managed Apache Airflow service, is an important yet often overlooked strategy for scaling workflow operations. Tasks queued for longer periods can create the illusion that additional workers are the solution, when in reality the root cause might […]</description>
										<content:encoded>&lt;p&gt;Optimizing the Airflow worker pool configuration in &lt;a href="http://aws.amazon.com/managed-workflows-for-apache-airflow" target="_blank" rel="noopener noreferrer"&gt;Amazon Managed Workflows for Apache Airflow&lt;/a&gt; (Amazon MWAA), the AWS fully managed Apache Airflow service, is an important yet often overlooked strategy for scaling workflow operations. Tasks queued for longer periods can create the illusion that additional workers are the solution, when in reality the root cause might lie elsewhere. The decision to scale isn’t always straightforward. DevOps engineers and system administrators frequently face the challenge of determining whether adding more workers will solve their performance issues or only increase operational cost without addressing the root cause.&lt;/p&gt; 
&lt;p&gt;This post explores different patterns for worker scaling decisions in Amazon MWAA, focusing on the task pool mechanism and its relationship to worker allocation. By examining specific scenarios and providing a practical decision framework, this post helps you determine whether adding workers is the right solution for your performance challenges, and if so, how to implement this scaling effectively.&lt;/p&gt; 
&lt;h1&gt;Main patterns&lt;/h1&gt; 
&lt;p&gt;This section discusses the most frequently seen problems that raise the question if adding additional workers would improve the health of your environment.&lt;/p&gt; 
&lt;h2&gt;High CPU&lt;/h2&gt; 
&lt;p&gt;Airflow serves as a workflow management platform that coordinates and schedules tasks to be run on external processing services. It acts as a central orchestrator that can trigger and monitor tasks across various data processing systems like &lt;a href="https://aws.amazon.com/glue/" target="_blank" rel="noopener noreferrer"&gt;AWS Glue&lt;/a&gt;, &lt;a href="https://aws.amazon.com/batch/" target="_blank" rel="noopener noreferrer"&gt;AWS Batch&lt;/a&gt;, &lt;a href="https://aws.amazon.com/emr/" target="_blank" rel="noopener noreferrer"&gt;Amazon EMR&lt;/a&gt;, and other specialized data processing tools. Rather than processing data itself, Airflow’s strength lies in managing complex workflows and coordinating jobs between different systems and services.&lt;/p&gt; 
&lt;p&gt;In Analytics and Big Data environments, there is a prevalent misconception that saturated resources automatically warrant adding more capacity. However, for Amazon MWAA, understanding your workflow characteristics and optimization opportunities should precede scaling decisions.&lt;/p&gt; 
&lt;p&gt;As you scale up your workflows, resource utilization of the Airflow clusters naturally increases. When workers consistently operate at full capacity, it may seem intuitive to add additional compute resources. However, this approach often masks underlying inefficiencies rather than resolving them.&lt;/p&gt; 
&lt;p&gt;For example, in Amazon MWAA if you are running a single task that is consuming 100% of the available CPU on your Amazon MWAA worker, adding additional workers will not resolve the problem as the task is not optimized nor split into smaller parts. As such, increasing the number of minimum workers will not bring the expected effect but will only increase the operating costs.&lt;/p&gt; 
&lt;p&gt;When your Amazon MWAA workers are consistently running above 90% CPU or Memory utilization, you’ve reached a critical decision point. Before taking actions, it is essential to understand the root cause. You have three primary options:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;Scale horizontally by adding additional workers to distribute the load.&lt;/li&gt; 
 &lt;li&gt;Scale vertically by upgrading to a larger environment class for more resources per worker.&lt;/li&gt; 
 &lt;li&gt;Optimize your DAGs and scheduling patterns to be more efficient and consume fewer resources.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;Each approach addresses different underlying issues, and choosing the right path depends on identifying whether you are facing a capacity constraint, resource-intensive task design, or workflow inefficiency. For guidance on optimization strategies, please refer to &lt;a href="https://docs.aws.amazon.com/mwaa/latest/userguide/best-practices-tuning.html" target="_blank" rel="noopener noreferrer"&gt;Performance tuning for Apache Airflow on Amazon MWAA&lt;/a&gt;.&lt;/p&gt; 
&lt;p&gt;To monitor the &lt;code&gt;CPUUtilization&lt;/code&gt; and &lt;code&gt;MemoryUtilization&lt;/code&gt; on the workers, refer to the &lt;a href="https://docs.aws.amazon.com/mwaa/latest/userguide/accessing-metrics-cw-container-queue-db.html#accessing-metrics-cw-container-queue-db-console" target="_blank" rel="noopener noreferrer"&gt;Accessing metrics in the Amazon CloudWatch console&lt;/a&gt; and choose the corresponding metrics.&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;Select a time window long enough to show usage patterns.&lt;/li&gt; 
 &lt;li&gt;Set period to &lt;strong&gt;1 Minute&lt;/strong&gt;.&lt;/li&gt; 
 &lt;li&gt;Set statistics to &lt;strong&gt;Maximum&lt;/strong&gt;.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;h2&gt;Long queue time&lt;/h2&gt; 
&lt;p&gt;Sometimes Airflow tasks are stuck in a queued state for a long time, which prevents DAGs from completing on time.&lt;/p&gt; 
&lt;p&gt;In Amazon MWAA, each environment class comes with configured minimum and maximum worker nodes. Each worker provides a pre-configured concurrency, which is the number of tasks that can run simultaneously on each worker at any given time. The behavior is controlled through &lt;code&gt;celery.worker_autoscale=(max,min).&lt;/code&gt;&lt;/p&gt; 
&lt;p&gt;For example, if you have minimum 4 mw1.small workers, with default Airflow configuration, you will be able to run 20 concurrent tasks (4 workers x 5 max_tasks_per_worker). If your system suddenly requires more than 20 tasks to execute concurrently, this will result in an autoscaling event. Amazon MWAA will decide how to scale your workers efficiently, and trigger the process. The autoscaling process, however, requires additional time to provision new workers resulting in additional tasks in queued status. To mitigate this queuing issue, consider the following:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;If the CPU utilization on the workers is low, increasing the &lt;code&gt;max&lt;/code&gt; value in &lt;code&gt;celery.worker_autoscale=(max,min)&lt;/code&gt; can reduce the time tasks stay in queued state as each worker will be able to process more tasks concurrently. Airflow worker can take tasks up to the defined task concurrency regardless of the availability of its own system resources. As a result, the base worker may reach 100% CPU/Memory utilization before Autoscaling takes effect.&lt;/li&gt; 
 &lt;li&gt;If you do not want to increase the task concurrency on the workers, increasing the minimum worker count can also be beneficial because having more available workers allows a higher number of tasks to run concurrently.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;h2&gt;Scheduling delays&lt;/h2&gt; 
&lt;p&gt;Adding new DAGs can not only affect your system resources, but it can also create uneven scheduling patterns. Some DAGs may experience delayed execution because of resource competition, even when the overall environment metrics appear healthy. This scheduling skew often manifests as inconsistent task pickup times, where certain workflows consistently wait longer in the queue while others execute promptly.&lt;/p&gt; 
&lt;p&gt;When &lt;a href="https://docs.aws.amazon.com/mwaa/latest/userguide/accessing-metrics-cw-container-queue-db.html" target="_blank" rel="noopener noreferrer"&gt;Amazon CloudWatch metrics&lt;/a&gt; show increasing variance in task scheduling times, particularly during periods of high DAG activity, it signals the need for environment optimization. This scenario requires careful analysis of execution patterns and resource utilization to determine if:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;While adding workers can help distribute the workload, this solution is most effective when the high utilization is primarily because of task execution load rather than DAG parsing or scheduling overhead. Adding more minimum workers will allow you to execute more tasks in parallel. For example, if you observe the value of &lt;code&gt;AWS/MWAA/ApproximateAgeOfOldestTask &lt;/code&gt;to be steadily increasing, it means that the workers are not able to consume the messages from the queue fast enough. Additionally, you can also monitor the &lt;code&gt;AWS/MWAA/QueuedTasks&lt;/code&gt; to identify similar patterns.&lt;/li&gt; 
 &lt;li&gt;Upgrading the environment class would provide better scheduling capacity. If the Scheduler is showing signs of strain or if you’re seeing high resource utilization across all components, upgrading to a larger environment class might be the most appropriate solution. This provides more resources to both the Scheduler and Workers, allowing for better handling of increased DAG complexity and volume. To validate the same, use &lt;code&gt;AWS/MWAA/CPUUtilization&lt;/code&gt; and &lt;code&gt;AWS/MWAA/MemoryUtilization&lt;/code&gt; in the Cluster metrics and choose &lt;code&gt;Scheduler,&lt;/code&gt; &lt;code&gt;BaseWorker&lt;/code&gt; and &lt;code&gt;AdditionalWorker&lt;/code&gt; metrics.&lt;/li&gt; 
 &lt;li&gt;Restructuring DAG schedules would reduce resource contention.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;The key is to understand your workflow patterns and identify whether the scheduling delays are because of insufficient worker capacity or other environmental constraints.&lt;/p&gt; 
&lt;h1&gt;Anti patterns&lt;/h1&gt; 
&lt;p&gt;This section showcases the most common anti patterns which make MWAA users think that adding more workers will improve performance.&lt;/p&gt; 
&lt;h2&gt;Underutilized workers&lt;/h2&gt; 
&lt;p&gt;When evaluating Amazon MWAA performance bottlenecks, it’s important to distinguish resource constraints and DAG design inefficiencies before scaling the environment.&lt;/p&gt; 
&lt;p&gt;Sometimes the Amazon MWAA environment has the capacity to run 100 tasks concurrently but your queue metrics (&lt;code&gt;AWS/MWAA/RunningTasks&lt;/code&gt;) show only 20 tasks active most of the time with no tasks remaining in queued state. In such scenarios, you are advised to check &lt;a href="https://docs.aws.amazon.com/mwaa/latest/userguide/accessing-metrics-cw-container-queue-db.html#accessing-metrics-cw-container-queue-db-list" target="_blank" rel="noopener noreferrer"&gt;Amazon CloudWatch&lt;/a&gt; for consistently low CPU and memory usage on existing workers during peak workload times. If this is confirmed, it is usually an indication of inefficiencies in DAG design, scheduling patterns, or Airflow configuration.&lt;/p&gt; 
&lt;p&gt;You have two primary options to address this:&lt;/p&gt; 
&lt;p&gt;1. &lt;strong&gt;Downsize&lt;/strong&gt;: If you do not expect your workload to increase, it is safe to assume you have over-provisioned your cluster. Start by removing any extra workers first and finally resolve to downsizing your environment class.&lt;/p&gt; 
&lt;p&gt;2. &lt;strong&gt;Optimize&lt;/strong&gt;: Fine tune your DAG scheduling and airflow configuration through Pools and Airflow configuration for concurrency to increase the throughput of your system.&lt;/p&gt; 
&lt;h2&gt;Misconfigured Airflow configurations that create artificial bottlenecks&lt;/h2&gt; 
&lt;p&gt;In Apache Airflow, performance bottlenecks often occur because of configuration settings, not actual resource constraints. At such times, DAG executions get delayed not because of insufficient compute, but because of incorrect concurrency configuration.&lt;/p&gt; 
&lt;p&gt;Efficient use of Amazon MWAA requires reviewing not only resource utilization for Workers and Schedulers but also concurrency configurations for artificially created bottlenecks. Sometimes one restrictive configuration prevents the scaling benefits of larger environment or additional workers. Always audit Airflow configurations if performance seems limited even when system metrics suggest spare capacity.&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Important consideration&lt;/strong&gt;: Amazon Managed Workflows for Apache Airflow (Amazon MWAA) does not automatically update the worker concurrency configuration when you change the environment class. This behavior is important to understand when scaling your environment. If you initially create an mw1.small environment, where each worker can handle up to 5 concurrent tasks by default. When you upgrade to a medium environment class (which supports 10 concurrent tasks per worker by default), the concurrency setting &lt;strong&gt;remains at 5&lt;/strong&gt; for in-place updated environments. You must manually update the concurrency configuration to take full advantage of the increased capacity available in the medium environment class.&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;Because of this you need to also update the Airflow configurations that control concurrency whenever you update the environment class. To update the concurrency setting after upgrading your environment class, modify the &lt;code&gt;celery.worker_autoscale&lt;/code&gt; configuration in your Apache Airflow configuration options. This makes sure your workers can process the maximum number of concurrent tasks supported by your new environment class.&lt;/p&gt; 
&lt;p&gt;Other times, an Amazon MWAA environment can be constrained by &lt;code&gt;max_active_runs&lt;/code&gt; or DAG concurrency controls instead of actual resource limits. These configuration-based throttles prevent tasks from running, even when the worker instances have available compute to handle the workload.&lt;/p&gt; 
&lt;p&gt;There is an important distinction between the two. Configuration limits act as artificial caps on parallelism, while true resource limits indicate that workers are fully utilizing their CPU or memory capacity. Understanding which type of constraint affects your environment helps you determine whether to adjust configuration settings or scale your infrastructure.&lt;/p&gt; 
&lt;p&gt;Adjusting Airflow configurations such as Pools, concurrency, max_active_runs solves performance problems without scaling workers. Some of the configurations you can use to control this behavior:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;&lt;strong&gt;max_active_runs_per_dag&lt;/strong&gt; (DAG level): Controls how many DAG runs for a given DAG are allowed at the same time. If set to 2, only 2 DAG runs can run concurrently, even if there is plenty of worker capacity left. Extra runs queue, making the DAG executions slow even though workers are idle.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;max_active_tasks:&lt;/strong&gt;Controls the concurrency field in a DAG definition (or setting at environment level) limits the number of tasks from the DAG running at any moment, regardless of overall system capacity or number of workers.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Pools:&lt;/strong&gt;Pools restrict how many tasks of a certain type (often resource heavy) can run at once. A pool with only 3 slots will throttle any tasks above 3 assigned to that pool, leaving workers idle.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Execution timeouts and retries:&lt;/strong&gt; If not tuned, failed tasks might fill up slots unnecessarily, stuck tasks can block worker slots and slow queue processing.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Scheduling intervals and dependencies:&lt;/strong&gt; Overlapping or inefficient scheduling may cause idle periods or excess contention for resources, affecting real throughput.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;strong&gt;How Airflow configurations can override each other&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;Airflow has multiple layers of concurrency and scheduling controls. Some at the environment level, some at the DAG/task level, and others for pools. Sometimes more restrictive settings override more permissive ones, resulting in unexpected queue buildup.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;DAG level vs Environment level:&lt;/strong&gt; If “max_active_runs_per_dag” (DAG level) is lower than the environment-level “max_active_runs_per_dag” or system wide concurrency, the DAG setting is used, throttling tasks even if the environment could do more.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Task level overrides:&lt;/strong&gt; Individual task definitions can have their own parameters like “max_active_tis_per_dag” which can cap runs per task and create a bottleneck if set lower than global settings.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Order of precedence:&lt;/strong&gt; The most restrictive relevant configuration at any level (Environment, DAG, Task) effectively sets the upper bound for parallel task execution.&lt;/p&gt; 
&lt;table border="1px" cellpadding="10px"&gt; 
 &lt;tbody&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;strong&gt;Setting Location&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Setting&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Effect on task throughput&lt;/strong&gt;&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Environment Level&lt;/td&gt; 
   &lt;td&gt;parallelism&lt;/td&gt; 
   &lt;td&gt;Max total tasks running on Scheduler&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;DAG Level&lt;/td&gt; 
   &lt;td&gt;max_active_runs&lt;/td&gt; 
   &lt;td&gt;Max simultaneous DAG runs&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Task Level&lt;/td&gt; 
   &lt;td&gt;concurrency&lt;/td&gt; 
   &lt;td&gt;Max concurrent task for that DAG&lt;/td&gt; 
  &lt;/tr&gt; 
 &lt;/tbody&gt; 
&lt;/table&gt; 
&lt;p&gt;Performance issues often resemble resource exhaustion, but actually derive from overly restrictive configurations. Audit all the preceding parameters carefully. You can loosen restrictive values step by step and monitor their effect before deciding to scale your cluster further. This approach ensures optimal and cost-efficient usage of your cloud resources without paying for idle capacity.&lt;/p&gt; 
&lt;h2&gt;Slow resource depletion from memory leaks&lt;/h2&gt; 
&lt;p&gt;A common scenario for memory leak or slow resource depletion in Amazon MWAA is when DAGs and tasks begin to fail or slow down over time. Scaling workers or increasing environment size does not resolve the underlying issue. This happens because the root cause is not a lack of capacity but rather an application-level leak that causes persistent exhaustion.&lt;/p&gt; 
&lt;p&gt;For example, as Airflow continuously runs tasks and parses DAGs over time, memory consumption can steadily increase across the environment. This might manifest as an Amazon MWAA metadata database experiencing declining FreeableMemory metrics despite consistent or even reduced workloads. When this occurs, database query performance gradually declines as memory resources become constrained for scheduler/worker &amp;amp; metadata database, ultimately affecting overall environment responsiveness since Airflow depends heavily on its metadata database for critical operations. This scenario is similar to how an application might create database connections without properly closing them, leading to resource exhaustion over time.&lt;/p&gt; 
&lt;h3&gt;Graph: Declining FreeableMemory and MemoryUtilization&lt;/h3&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/30/graph-freeablememory-memoryutilization-2026-04-30.png"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Common causes:&lt;/strong&gt;&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;&lt;strong&gt;Connection pool exhaustion:&lt;/strong&gt; DAGs that fail to properly close database connections can lead to connection pool exhaustion and memory leaks in the database.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Resource-intensive operations:&lt;/strong&gt; Complex, long-running queries or XCOM operations against the metadata database can consume excessive memory.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Inefficient DAG design:&lt;/strong&gt; DAGs with numerous top-level Python calls can trigger database queries during DAG parsing. For instance, using variable.get() calls at the DAG level rather than at the task level creates unnecessary database load.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;strong&gt;Recommended solutions:&lt;/strong&gt;&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;&lt;strong&gt;Implement Amazon CloudWatch monitoring:&lt;/strong&gt; Establish Amazon CloudWatch alarms for FreeableMemory with appropriate thresholds to detect issues early.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Regular database maintenance:&lt;/strong&gt; Perform scheduled database clean-up operations to purge historical data that is no longer needed.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Optimize DAG code:&lt;/strong&gt; Refactor DAGs to move database operations like variable.get() from the DAG level to the task level to reduce parsing overhead.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Connection management:&lt;/strong&gt; Make sure all database connections are properly closed after use to prevent connection pool exhaustion.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;By following the preceding recommendations you can maintain healthy memory utilization for the metadata database and maintain optimal performance of your Amazon MWAA environment without needing to scale workers.&lt;/p&gt; 
&lt;h1&gt;Conclusion&lt;/h1&gt; 
&lt;p&gt;The decision to add workers in Amazon MWAA environments requires careful consideration of multiple factors beyond simple task queue metrics. In this post, we showed that while adding workers can address certain performance challenges, it’s often not the optimal first response to system bottlenecks.&lt;/p&gt; 
&lt;p&gt;Key considerations before scaling workers include:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;Root cause analysis 
  &lt;ul&gt; 
   &lt;li&gt;Verify whether high CPU/memory usage stems from task optimization issues.&lt;/li&gt; 
   &lt;li&gt;Examine if queuing problems result from configuration constraints rather than resource limitations.&lt;/li&gt; 
   &lt;li&gt;Investigate potential memory leaks or resource depletion patterns.&lt;/li&gt; 
  &lt;/ul&gt; &lt;/li&gt; 
 &lt;li&gt;Configuration optimization 
  &lt;ul&gt; 
   &lt;li&gt;Review and adjust Airflow parameters (concurrency settings, pools, timeouts).&lt;/li&gt; 
   &lt;li&gt;Understand the interaction between different configuration layers.&lt;/li&gt; 
   &lt;li&gt;Optimize DAG design and scheduling patterns.&lt;/li&gt; 
  &lt;/ul&gt; &lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;The most successful Amazon MWAA implementations follow a systematic approach: first optimizing existing resources and configurations, then scaling workers only when justified by data-driven capacity planning. This approach ensures cost-effective operations while maintaining reliable workflow performance.&lt;/p&gt; 
&lt;p&gt;Remember that worker scaling is only one tool in the Amazon MWAA optimization toolkit. Long-term success depends on building a comprehensive performance management strategy that combines proper monitoring, proactive capacity planning, and continuous optimization of your Airflow workflows.&lt;/p&gt; 
&lt;p&gt;In the next post, we discuss capacity planning and the steps you need to perform before adding additional DAGs in your environment so that you can plan for the additional load and make sure you have enough headroom.&lt;/p&gt; 
&lt;p&gt;To get started, visit the &lt;a href="https://aws.amazon.com/managed-workflows-for-apache-airflow/" target="_blank" rel="noopener noreferrer"&gt;Amazon MWAA product page&lt;/a&gt; and the &lt;a href="https://docs.aws.amazon.com/mwaa/latest/userguide/best-practices-tuning.html" target="_blank" rel="noopener noreferrer"&gt;Performance tuning for Apache Airflow on Amazon MWAA&lt;/a&gt; page.&lt;/p&gt; 
&lt;p&gt;If you have questions or want to share your MWAA scaling experiences, leave a comment below.&lt;/p&gt; 
&lt;h2&gt;About the authors&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="aligncenter size-full wp-image-29797" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/29/BDB-4941-2.jpeg" alt="Boyko Radulov" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Boyko Radulov&lt;/h3&gt; 
  &lt;p&gt;Boyko is a Senior Cloud Support Engineer at Amazon Web Services (AWS), Amazon MWAA and AWS Glue Subject Matter Expert. He works closely with customers to build and optimize their workloads on AWS while reducing the overall cost. Beyond work, he is passionate about sports and travelling.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="aligncenter size-full wp-image-29797" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/29/BDB-4941-3.jpeg" alt="Kamen Sharlandjiev" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Kamen Sharlandjiev&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/ksharlandjiev/" target="_blank" rel="noopener"&gt;Kamen&lt;/a&gt; is a Principal Big Data and ETL Solutions Architect, Amazon MWAA and AWS Glue ETL expert. He’s on a mission to make life easier for customers who are facing complex data integration and orchestration challenges. His secret weapon? Fully managed AWS services that can get the job done with minimal effort. Follow Kamen on &lt;a href="https://www.linkedin.com/in/ksharlandjiev/" target="_blank" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt; to keep up to date with the latest Amazon MWAA and AWS Glue features and news.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="aligncenter size-full wp-image-29797" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/29/BDB-4941-4.jpeg" alt="Venu Thangalapally" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Venu Thangalapally&lt;/h3&gt; 
  &lt;p&gt;Venu is a Senior Solutions Architect at AWS, based in Chicago, with deep expertise in cloud architecture, data and analytics, containers, and application modernization. He partners with financial service industry customers to translate business goals into secure, scalable, and compliant cloud solutions that deliver measurable value. Venu is passionate about using technology to drive innovation and operational excellence.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="aligncenter size-full wp-image-29797" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/29/BDB-4941-5.jpeg" alt="Harshawardhan Kulkarni" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Harshawardhan Kulkarni&lt;/h3&gt; 
  &lt;p&gt;Harshawardhan is a Partner Technical Account Manager at AWS, Amazon MWAA Subject Matter Expert. Based in Dublin Ireland, he partners with Enterprise Customers across EMEA to help navigate complex workflows and orchestration challenges while ensuring best practice implementation. Outside of work, he enjoys traveling and spending time with his family.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="aligncenter size-full wp-image-29797" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/29/BDB-4941-6.jpeg" alt="Andrew McKenzie" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Andrew McKenzie&lt;/h3&gt; 
  &lt;p&gt;Andrew is a Data Engineer and Educator who uses deep technical expertise from his time at AWS. As a former Amazon MWAA Subject Matter Expert, he now focuses on building data solutions and teaching data engineering best practices.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
					
			
		
		
			</item>
		<item>
		<title>Unified observability in Amazon OpenSearch Service: metrics, traces, and AI agent debugging in a single interface</title>
		<link>https://aws.amazon.com/blogs/big-data/unified-observability-in-amazon-opensearch-service-metrics-traces-and-ai-agent-debugging-in-a-single-interface/</link>
					
		
		<dc:creator><![CDATA[Muthu Pitchaimani]]></dc:creator>
		<pubDate>Tue, 28 Apr 2026 17:29:01 +0000</pubDate>
				<category><![CDATA[Amazon OpenSearch Service]]></category>
		<category><![CDATA[Analytics]]></category>
		<category><![CDATA[Launch]]></category>
		<guid isPermaLink="false">1402dbf1fe29f4a55e798bc7d879de3064ae09bf</guid>

					<description>Amazon OpenSearch Service now brings application monitoring, native Amazon Managed Service for Prometheus integration, and AI agent tracing together in OpenSearch UI's observability workspace. In this post, we walk through two real-world scenarios using the OpenTelemetry sample app: a multi-agent travel planner facing slow processing, and a checkout flow quietly failing on one microservice.</description>
										<content:encoded>&lt;p&gt;&lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/what-is.html" target="_blank" rel="noopener noreferrer"&gt;Amazon OpenSearch Service&lt;/a&gt; now brings application monitoring, native &lt;a href="https://docs.aws.amazon.com/prometheus/latest/userguide/what-is-Amazon-Managed-Service-Prometheus.html" target="_blank" rel="noopener noreferrer"&gt;Amazon Managed Service for Prometheus&lt;/a&gt; integration, and AI agent tracing together in &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/application.html" target="_blank" rel="noopener noreferrer"&gt;OpenSearch UI&lt;/a&gt;‘s observability workspace. You can query Prometheus metrics with &lt;a href="https://prometheus.io/docs/prometheus/latest/querying/basics/" target="_blank" rel="noopener noreferrer"&gt;PromQL&lt;/a&gt; alongside logs and traces stored in Amazon OpenSearch Service, trace an AI agent’s full reasoning chain down to the failing tool call, and drill from a service-level health view to the exact span that caused a checkout failure, all without leaving the interface.&lt;/p&gt; 
&lt;p&gt;In this post, we walk through two real-world scenarios using the OpenTelemetry sample app: a multi-agent travel planner facing slow processing, and a checkout flow quietly failing on one microservice. We chase each one to its root cause using these new capabilities.&lt;/p&gt; 
&lt;h2&gt;Scenario 1: An underperforming AI agent&lt;/h2&gt; 
&lt;p&gt;Your multi-agent travel planner is live and users start reporting slow responses. With the new AI agent tracing capability in Amazon OpenSearch Service, you can trace the agent’s full processing path to pinpoint exactly where things went wrong.&lt;/p&gt; 
&lt;p&gt;In any observability workspace in OpenSearch UI, navigate to &lt;strong&gt;Application Map&lt;/strong&gt; in the left navigation pane.&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90438" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/image003.jpg" alt="OpenSearch Service application map" width="2258" height="1520"&gt;&lt;/p&gt; 
&lt;p&gt;You can see the full topology of your system including the travel agent and the sub-agents it calls. The travel agent node shows elevated latency and occasional errors. Select it, and the side panel confirms that latency is up but the latency chart shows intermittent spikes rather than consistent degradation.&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90439" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/image005-scaled.jpg" alt="System topology with service health metrics" width="2560" height="1302"&gt;&lt;/p&gt; 
&lt;p&gt;The application map tells you something is wrong, but understanding &lt;em&gt;why&lt;/em&gt; an AI agent is underperforming requires seeing its reasoning chain. Select &lt;strong&gt;Agent Traces&lt;/strong&gt; in the left navigation pane, then filter by service name and time range.&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90440" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/image007.png" alt="Agent processing steps with invocation data" width="1430" height="728"&gt;&lt;/p&gt; 
&lt;p&gt;Select one of the traces to see the trace tree. Unlike a traditional span waterfall, this view organizes around the agent’s reasoning chain: the root agent span, the LLM calls it made, the tools it invoked, and how they nested each step color-coded by type. The trace map provides a visual directed graph of the same execution. You can see which model was called, how many input and output tokens were consumed, and the actual messages sent to and received from the model.&lt;/p&gt; 
&lt;p&gt;A tool call inside the weather agent errored out. The agent then spent additional time reasoning about the failure before returning a partial response explaining the intermittent latency spikes and occasional faults.&lt;/p&gt; 
&lt;h3&gt;Why this matters for AI agents&lt;/h3&gt; 
&lt;p&gt;Agents make autonomous decisions based on LLM responses, tool results, and chained reasoning. Unlike traditional microservices with deterministic code paths, agent behavior varies across executions. Without semantic tracing that captures these AI-specific signals, root-cause analysis is guesswork. The trace tree surfaced the model name, token counts, and failing tool call because the travel planner was instrumented with OpenTelemetry’s generative AI semantic conventions. The next section describes how.&lt;/p&gt; 
&lt;h3&gt;Instrumenting AI agents&lt;/h3&gt; 
&lt;p&gt;OpenTelemetry auto-instrumentation enriches spans with well-known attributes for HTTP, database, and gRPC calls. AI agents need a different set of attributes such as which LLM was called, what tokens were consumed, which tools were invoked, that standard instrumentation doesn’t cover.&lt;/p&gt; 
&lt;p&gt;The &lt;a href="https://opentelemetry.io/docs/specs/semconv/gen-ai/" target="_blank" rel="noopener"&gt;OpenTelemetry gen_ai semantic conventions&lt;/a&gt; define standard attributes for these signals, including &lt;code&gt;gen_ai.operation.name&lt;/code&gt;, &lt;code&gt;gen_ai.usage.input_tokens&lt;/code&gt;, &lt;code&gt;gen_ai.request.model&lt;/code&gt;, and &lt;code&gt;gen_ai.tool.name&lt;/code&gt;. When Amazon OpenSearch Service receives spans with these attributes, it categorizes them by operation type (agent, LLM, tool, embeddings, retrieval) and renders the agent trace tree and trace map views.&lt;/p&gt; 
&lt;p&gt;The Python SDK provides one way to generate these spans. To send traces to Amazon OpenSearch Ingestion, configure the SDK with AWS Signature Version 4 (SigV4) authentication. The &lt;code&gt;AWSSigV4OTLPExporter&lt;/code&gt; cryptographically signs each HTTP request to help prevent unauthorized data ingestion. The calling identity needs an IAM policy that grants &lt;code&gt;osis:Ingest&lt;/code&gt; on your pipeline’s ARN. Credentials are resolved through the standard AWS credential provider chain.&lt;/p&gt; 
&lt;pre&gt;&lt;code class="language-python"&gt;from opensearch_genai_observability_sdk_py import register, AWSSigV4OTLPExporter

exporter = AWSSigV4OTLPExporter(
    endpoint="https://pipeline.us-east-1.osis.amazonaws.com/v1/traces",
    service="osis",
    region="us-east-1",
)

register(service_name="my-agent", exporter=exporter)
&lt;/code&gt;&lt;/pre&gt; 
&lt;p&gt;Use the &lt;code&gt;@observe&lt;/code&gt; decorator to trace agent functions and &lt;code&gt;enrich()&lt;/code&gt; to add model metadata:&lt;/p&gt; 
&lt;pre&gt;&lt;code class="language-python"&gt;@observe(op=Op.EXECUTE_TOOL)
def get_weather(city: str) -&amp;gt; dict:
    return {"city": city, "temp": 22, "condition": "sunny"}

@observe(op=Op.INVOKE_AGENT)
def assistant(query: str) -&amp;gt; str:
    enrich(model="gpt-4o", provider="openai")
    data = get_weather("Paris")
    return f"{data['condition']}, {data['temp']}C"

result = assistant("What's the weather?")
&lt;/code&gt;&lt;/pre&gt; 
&lt;p&gt;The SDK also supports auto-instrumentation for OpenAI, Anthropic, Amazon Bedrock, LangChain, LlamaIndex, and others. Because the instrumentation is built on OpenTelemetry standards, any agent framework that emits spans with &lt;code&gt;gen_ai.*&lt;/code&gt; attributes is compatible with OpenSearch UI.&lt;/p&gt; 
&lt;h2&gt;Scenario 2: Investigating a microservice issue&lt;/h2&gt; 
&lt;p&gt;AI agents are only one part of most production environments. The same interface surfaces telemetry from conventional microservices, where the troubleshooting workflow follows a more familiar path.&lt;/p&gt; 
&lt;p&gt;Your ecommerce checkout begins paging during a busy traffic window. From OpenSearch UI, navigate to &lt;strong&gt;APM Services&lt;/strong&gt; in the left navigation pane. Every instrumented service is listed alongside its health indicators. The checkout service shows an elevated error rate.&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90441" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/image009-scaled.jpg" alt="Service overview panel with request, error, duration metrics" width="2560" height="1306"&gt;&lt;/p&gt; 
&lt;p&gt;Select the affected service. The detail view shows Request, Error, and Duration (RED) metrics: request rate is climbing, fault rate has spiked in the last 15 minutes, and p99 duration has doubled. You can see exactly when the degradation started.&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90442" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/image011.png" alt="Service drilldown health dashboard" width="1431" height="723"&gt;&lt;/p&gt; 
&lt;p&gt;Drill into the correlated spans for the affected time window. The span list shows multiple failed requests, all hitting the same endpoint. Select one to see the full trace waterfall. The checkout service called &lt;code&gt;prepareOrder&lt;/code&gt;, which failed trying to retrieve a product from the catalog. The error message in the span details tells you exactly what went wrong, that’s your root cause.&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone wp-image-90443 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/image013.png" alt="Waterfall transaction view of spans" width="1429" height="730"&gt;&lt;/p&gt; 
&lt;h3&gt;Checking the infrastructure with PromQL&lt;/h3&gt; 
&lt;p&gt;In both scenarios, the natural next question is whether the problem originates in the application or in the infrastructure beneath it. With the new Amazon Managed Service for Prometheus integration, you can answer that question without leaving OpenSearch UI.&lt;/p&gt; 
&lt;p&gt;Prometheus metrics are now queryable directly from the same workspace using native PromQL syntax, alongside the logs and traces you’ve already been navigating.&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90444" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/image015.png" alt="Metric query showing Prometheus Query Language" width="1431" height="820"&gt;&lt;/p&gt; 
&lt;p&gt;For the database timeout in Scenario 2, run a PromQL query to check the database instance’s read/write throughput for the same time window. For the agent latency issue in Scenario 1, check the LLM endpoint’s response time metrics to see if the slowness originates from the model provider.&lt;/p&gt; 
&lt;p&gt;This is a key architectural decision: metrics continue to live in Amazon Managed Service for Prometheus, logs and traces continue to live in Amazon OpenSearch Service, and neither signal is copied or warehoused into a second store. Each backend remains the single store for the data type it’s purpose-built to handle, while OpenSearch UI federates queries across both at runtime. The cost, retention, and operational model of each store stay intact while the troubleshooting workflow collapses into a single interface.&lt;/p&gt; 
&lt;p&gt;To configure the OpenTelemetry Collector and OpenSearch Ingestion pipelines that route metrics into Amazon Managed Service for Prometheus, see &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/observability-ingestion.html" target="_blank" rel="noopener"&gt;Ingesting application telemetry&lt;/a&gt;.&lt;/p&gt; 
&lt;h2&gt;How it’s wired together&lt;/h2&gt; 
&lt;p&gt;The following diagram shows the end-to-end architecture. Applications instrumented with OpenTelemetry send traces, logs, and metrics over OTLP to Amazon OpenSearch Ingestion. OpenSearch Ingestion routes each signal to the appropriate store: traces and logs land in Amazon OpenSearch Service, while metrics flow into Amazon Managed Service for Prometheus. OpenSearch UI then queries both stores to render the Application Map, Services catalog, Agent Traces, and Metrics views.&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90446" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/image019.png" alt="OpenSearch Observability Stack Architecture" width="1202" height="472"&gt;&lt;/p&gt; 
&lt;p&gt;The entire experience rests on open-source foundations, Prometheus for metrics, OpenSearch for logs and traces, and OpenTelemetry for instrumentation, so teams already running an OpenTelemetry collector can adopt it by updating the collector’s export configuration to point at Amazon OpenSearch Ingestion, with no proprietary agents or rewritten instrumentation required.&lt;/p&gt; 
&lt;h2&gt;Getting started&lt;/h2&gt; 
&lt;p&gt;To enable these capabilities, log in to OpenSearch UI’s observability workspace, select the &lt;strong&gt;Gear&lt;/strong&gt; icon in the bottom left corner to open Settings and setup, and verify that the &lt;strong&gt;Observability:apmEnabled&lt;/strong&gt; toggle is on under the Observability section. OpenSearch UI is available at no additional charge for Amazon OpenSearch Service customers.&lt;/p&gt; 
&lt;div style="width: 640px;" class="wp-video"&gt;
 &lt;video class="wp-video-shortcode" id="video-90656-1" width="640" height="360" preload="metadata" controls="controls"&gt;
  &lt;source type="video/mp4" src="https://d2908q01vomqb2.cloudfront.net/artifacts/DBSBlogs/BDB-5856/BDB-5856.mp4?_=1"&gt;
 &lt;/video&gt;
&lt;/div&gt; 
&lt;p&gt;&lt;strong&gt;Explore locally first.&lt;/strong&gt; The &lt;a href="https://opensearch.org/platform/observability-stack/" target="_blank" rel="noopener"&gt;OpenSearch Observability Stack&lt;/a&gt; gives you a fully configured environment including application monitoring, agent tracing, and Prometheus integration, running on your machine with a single install command. It ships with sample instrumented services, including a multi-agent travel planner, so you can explore the full workflow with real telemetry data out of the box.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;For AI agent development.&lt;/strong&gt; &lt;a href="https://observability.opensearch.org/docs/agent-health/" target="_blank" rel="noopener"&gt;Agent Health&lt;/a&gt; is an open-source, evaluation-driven observability tool designed for local development. It gives you execution flow graphs, token tracking, and tool invocation visibility right in your development loop, before you push to production.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;For production.&lt;/strong&gt; The &lt;a href="https://observability.opensearch.org/docs/send-data/ai-agents/python/" target="_blank" rel="noopener"&gt;Python SDK&lt;/a&gt; provides one-line setup and decorator-based tracing with gen_ai semantic conventions, with auto-instrumentation support for OpenAI, Anthropic, Amazon Bedrock, LangChain, LlamaIndex, and others. See the &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/observability.html" target="_blank" rel="noopener"&gt;Amazon OpenSearch Service documentation&lt;/a&gt; and the &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/direct-query-prometheus-overview.html" target="_blank" rel="noopener"&gt;Amazon Managed Service for Prometheus integration guide&lt;/a&gt; for the full managed experience.&lt;/p&gt; 
&lt;hr style="width: 80%"&gt; 
&lt;h2&gt;About the authors&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignnone size-full wp-image-90447" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/image021.png" alt="" width="100" height="133"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Muthu Pitchaimani&lt;/h3&gt; 
  &lt;p&gt;Muthu is a Search Specialist with Amazon OpenSearch Service. He builds large-scale search applications and solutions. Muthu is interested in the topics of networking and security, and is based out of Austin, Texas.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignnone size-full wp-image-90450" style="font-size: 16px" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/image022.png" alt="" width="100" height="102"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Raaga N.G&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/raaga-shree/" target="_blank" rel="noopener noreferrer"&gt;Raaga&lt;/a&gt; is a Solutions Architect at AWS with over 5 years of experience helping enterprises modernize their technology landscape and build scalable, cloud-native solutions. She partners with customers to translate business requirements into efficient cloud architectures that drive measurable outcomes, supporting their journey from application modernization to AI adoption through thoughtful, customer-centric solutions.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignnone size-full wp-image-90448" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/image023.png" alt="" width="1920" height="2560"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Rekha Thottan&lt;/h3&gt; 
  &lt;p&gt;Rekha Thottan is a Senior Technical Product Manager at AWS OpenSearch, contributing to AI agent observability and evaluation for the OpenSearch Project.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignnone size-full wp-image-90449" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/image025.png" alt="" width="576" height="768"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Kevin Lewin&lt;/h3&gt; 
  &lt;p&gt;Kevin is a Cloud Operations Specialist Solution Architect at Amazon Web Services. He focuses on helping customers achieve their operational goals through observability and automation.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
					
			
		
		<enclosure length="30351156" type="video/mp4" url="https://d2908q01vomqb2.cloudfront.net/artifacts/DBSBlogs/BDB-5856/BDB-5856.mp4"/>

			</item>
		<item>
		<title>Migrate to Apache Flink 2.2 on Amazon Managed Service for Apache Flink</title>
		<link>https://aws.amazon.com/blogs/big-data/migrate-to-apache-flink-2-2-on-amazon-managed-service-for-apache-flink/</link>
					
		
		<dc:creator><![CDATA[Francisco Morillo]]></dc:creator>
		<pubDate>Mon, 27 Apr 2026 17:57:34 +0000</pubDate>
				<category><![CDATA[Amazon Managed Service for Apache Flink]]></category>
		<category><![CDATA[Analytics]]></category>
		<category><![CDATA[AWS CloudFormation]]></category>
		<category><![CDATA[Technical How-to]]></category>
		<guid isPermaLink="false">85924922056cff871769bd6a673bf09a18f9e7a4</guid>

					<description>In this post, we explain what's new in Amazon Managed Service for Apache Flink 2.2, provide a guided migration using CLI commands, console instructions, and code examples, and show you how to monitor the upgrade and roll back if needed.</description>
										<content:encoded>&lt;p&gt;Migrating to &lt;a href="https://aws.amazon.com/managed-service-apache-flink/" target="_blank" rel="noopener noreferrer"&gt;Apache Flink 2.2&lt;/a&gt; on&amp;nbsp;&lt;a href="https://aws.amazon.com/managed-service-apache-flink/" target="_blank" rel="noopener noreferrer"&gt;Amazon Managed Service for Apache Flink&amp;nbsp;&lt;/a&gt;gives you access to Java 17 runtime, faster checkpoints and recovery through RocksDB 8.10.0, and SQL-native artificial intelligence and machine learning (AI/ML) inference. If you run Flink 1.x today, you might be dealing with an aging Java 11 runtime that will no longer receive standard support by the end of this year, slower state backend performance, and a fragmented API surface split across DataSet, DataStream, and legacy connector interfaces. Flink 2.2 addresses these gaps in a single major version upgrade.&lt;/p&gt; 
&lt;p&gt;&lt;a href="https://flink.apache.org/" target="_blank" rel="noopener noreferrer"&gt;Apache Flink&amp;nbsp;&lt;/a&gt;is an open source distributed processing engine for stream and batch data, with first-class support for stateful processing and event-time semantics. Amazon Managed Service for Apache Flink removes the operational overhead of running Flink. You provide your application code, and the service provisions, scales, checkpoints, and patches the infrastructure for you.&lt;/p&gt; 
&lt;p&gt;In this post, we explain what’s new in &lt;a href="https://docs.aws.amazon.com/managed-flink/latest/java/flink-2-2.html" target="_blank" rel="noopener noreferrer"&gt;Amazon Managed Service for Apache Flink 2.2&lt;/a&gt;, provide a guided migration using CLI commands, console instructions, and code examples, and show you how to monitor the upgrade and roll back if needed.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Before you upgrade:&lt;/strong&gt;&amp;nbsp;Flink 2.2 removes the DataSet API, drops Java 11 support, and replaces legacy connector interfaces. We recommend reviewing the&amp;nbsp;&lt;a href="https://docs.aws.amazon.com/managed-flink/latest/java/flink-2-2-upgrade-guide.html" target="_blank" rel="noopener noreferrer"&gt;Upgrading to Flink 2.2: Complete Guide&amp;nbsp;&lt;/a&gt;and the&amp;nbsp;&lt;a href="https://docs.aws.amazon.com/managed-flink/latest/java/state-compatibility.html" target="_blank" rel="noopener noreferrer"&gt;State Compatibility Guide for Flink 2.2 Upgrade&lt;/a&gt;s&amp;nbsp;before upgrading production applications.&lt;/p&gt; 
&lt;h2&gt;What’s new in Amazon Managed Service for Apache Flink 2.2&lt;/h2&gt; 
&lt;p&gt;This release spans runtime upgrades, SQL, and Table API capabilities. The following sections break down each area.&lt;/p&gt; 
&lt;h3&gt;Runtime and performance&lt;/h3&gt; 
&lt;p&gt;These changes improve application performance and bring your runtime up to current standards.&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;&lt;a href="https://nightlies.apache.org/flink/flink-docs-release-2.2/docs/deployment/java_compatibility/" target="_blank" rel="noopener noreferrer"&gt;Java 17 runtime&lt;/a&gt; –&lt;/strong&gt;&amp;nbsp;Flink 2.2 requires Java 17. Build your application code with JDK 17 for better garbage collection, a more secure runtime, and modern language features like sealed classes and records. Java 11 is no longer supported.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;&lt;a href="https://nightlies.apache.org/flink/flink-docs-release-2.2/docs/dev/python/overview/" target="_blank" rel="noopener noreferrer"&gt;Python 3.12&lt;/a&gt; –&lt;/strong&gt;&amp;nbsp;Flink 2.2 requires Python 3.9+, with Python 3.12 as the default. Python 3.8 is no longer supported.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;RocksDB 8.10.0 –&lt;/strong&gt;&amp;nbsp;Your stateful applications benefit from improved I/O performance with the upgraded state backend, resulting in faster checkpoints and recovery.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Dedicated collection serializers –&lt;/strong&gt;&amp;nbsp;Improved serializers for Map, List, and Set types reduce serialization overhead, which lowers checkpoint sizes for applications that use these data structures frequently.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Kryo 5.6 –&lt;/strong&gt;&amp;nbsp;Kryo upgrades from version 2.24–5.6. This has state compatibility implications covered in the migration section.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h3&gt;SQL and Table API highlights&lt;/h3&gt; 
&lt;p&gt;With Flink 2.2, you can:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Call Machine Leaning (ML) models directly from SQL using &lt;a href="https://nightlies.apache.org/flink/flink-docs-release-2.2/" target="_blank" rel="noopener noreferrer"&gt;ML_PREDICT&lt;/a&gt; and CREATE MODEL&lt;/li&gt; 
 &lt;li&gt;Work with semistructured data through the native &lt;a href="https://nightlies.apache.org/flink/flink-docs-master/docs/sql/reference/data-types/" target="_blank" rel="noopener noreferrer"&gt;VARIANT type&lt;/a&gt;&lt;/li&gt; 
 &lt;li&gt;Build stateful event-driven logic in SQL with &lt;a href="https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/table/functions/ptfs/" target="_blank" rel="noopener noreferrer"&gt;ProcessTableFunction&lt;/a&gt;&lt;/li&gt; 
 &lt;li&gt;Run more efficient streaming joins with &lt;a href="https://nightlies.apache.org/flink/flink-docs-stable/api/java/org/apache/flink/table/runtime/operators/join/stream/StreamingMultiJoinOperator.html" target="_blank" rel="noopener noreferrer"&gt;StreamingMultiJoinOperator&lt;/a&gt; and &lt;a href="https://flink.apache.org/2025/12/04/apache-flink-2.2.0-advancing-real-time-data--ai-and-empowering-stream-processing-for-the-ai-era/#delta-join" target="_blank" rel="noopener noreferrer"&gt;Delta Join&lt;/a&gt;&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;For details on these features, see the&amp;nbsp;&lt;a href="https://flink.apache.org/2025/12/04/apache-flink-2.2.0-advancing-real-time-data--ai-and-empowering-stream-processing-for-the-ai-era/" target="_blank" rel="noopener noreferrer"&gt;Apache Flink 2.2 release documentation&lt;/a&gt;.&lt;/p&gt; 
&lt;h2&gt;Migrating from Flink 1.x to 2.2&lt;/h2&gt; 
&lt;h3&gt;In-place version upgrades&lt;/h3&gt; 
&lt;p&gt;You can upgrade a running Flink 1.x application to 2.2 using the &lt;a href="https://docs.aws.amazon.com/managed-flink/latest/java/how-in-place-version-upgrades.html" target="_blank" rel="noopener noreferrer"&gt;UpdateApplication API&lt;/a&gt;, the AWS Management Console, AWS CloudFormation, the AWS SDK, and Terraform Modules. The upgrade preserves your application configuration, logs, metrics, tags, and, if your state and binaries are compatible.&lt;/p&gt; 
&lt;h3&gt;Auto-rollback&lt;/h3&gt; 
&lt;p&gt;With &lt;a href="https://docs.aws.amazon.com/managed-flink/latest/java/troubleshooting-system-rollback.html" target="_blank" rel="noopener noreferrer"&gt;auto-rollback&lt;/a&gt; turned on, binary incompatibilities detected during job startup trigger an automatic revert to the previous Flink version within minutes, with no manual intervention required. For state incompatibilities that surface as restart loops after a successful upgrade, invoke the Rollback API to return to your previous version and state.&lt;/p&gt; 
&lt;h3&gt;Unsupported open source features&lt;/h3&gt; 
&lt;p&gt;The following Flink 2.2 features aren’t currently supported in Amazon Managed Service for Apache Flink because they’re still considered experimental: Materialized Tables, ForSt State Backend (disaggregated state storage), Java 21, and custom metric reporters/telemetry configurations. We continue to evaluate these features as they mature in the Apache Flink project and will share updates on availability. You can have a closer look to which features are supported in &lt;a href="https://docs.aws.amazon.com/managed-flink/latest/java/flink-2-2.html#flink-2-2-supported-features" target="_blank" rel="noopener noreferrer"&gt;Apache Flink 2.2 features supported&lt;/a&gt;&lt;/p&gt; 
&lt;p&gt;Now that you know what’s changed, the next section walks through the migration process.&lt;/p&gt; 
&lt;h2&gt;Prerequisites&lt;/h2&gt; 
&lt;p&gt;Before starting the migration, confirm that you have the following in place:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;An existing Apache Flink 1.x application running on Amazon Managed Service for Apache Flink.&lt;/li&gt; 
 &lt;li&gt;JDK 17 installed in your local build environment.&lt;/li&gt; 
 &lt;li&gt;The AWS Command Line Interface (AWS CLI) installed and configured with permissions to call the&amp;nbsp;kinesisanalyticsv2&amp;nbsp;APIs (UpdateApplication, CreateApplicationSnapshot, DescribeApplication, RollbackApplication).&lt;/li&gt; 
 &lt;li&gt;An Amazon Simple Storage Service (Amazon S3) bucket to upload your updated application JAR.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;We recommend testing each phase on a non-production replica of your application before applying the same steps to production.&lt;/p&gt; 
&lt;h3&gt;Step 1: Update your application code&lt;/h3&gt; 
&lt;p&gt;Start by updating your Flink dependencies to version 2.2.0 and replacing deprecated APIs. The following sections show the most common changes.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Update your pom.xml:&lt;/strong&gt;&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-html"&gt;&amp;lt;properties&amp;gt;
    &amp;lt;flink.version&amp;gt;2.2.0&amp;lt;/flink.version&amp;gt;
    &amp;lt;java.version&amp;gt;17&amp;lt;/java.version&amp;gt;
&amp;lt;/properties&amp;gt;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;&lt;strong&gt;Replace legacy Kinesis connectors:&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;Flink 2.2 removes the&amp;nbsp;FlinkKinesisConsumer&amp;nbsp;and&amp;nbsp;FlinkKinesisProducer&amp;nbsp;classes. The following example shows how to migrate to the FLIP-27 based&amp;nbsp;KinesisStreamsSource.Before (Flink 1.x):&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-java"&gt;FlinkKinesisConsumer&amp;lt;String&amp;gt; consumer = new FlinkKinesisConsumer&amp;lt;&amp;gt;(
    "my-stream",
    new SimpleStringSchema(),
    consumerConfig);
env.addSource(consumer);&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;After (Flink 2.2):&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-java"&gt;KinesisStreamsSource&amp;lt;String&amp;gt; source = KinesisStreamsSource.&amp;lt;String&amp;gt;builder()
    .setStreamArn("arn:aws:kinesis:us-east-1:123456789012:stream/my-stream")
    .setDeserializationSchema(new SimpleStringSchema())
    .build();
env.fromSource(source, WatermarkStrategy.noWatermarks(), "Kinesis Source");&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;&lt;strong&gt;Update connector dependencies:&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;The following AWS connectors have Flink 2.x-compatible releases:&lt;/p&gt; 
&lt;table class="styled-table" border="1px" cellpadding="10px"&gt; 
 &lt;thead&gt; 
  &lt;tr&gt; 
   &lt;th style="padding: 10px;border: 1px solid #dddddd"&gt;&lt;strong&gt;Connector&lt;/strong&gt;&lt;/th&gt; 
   &lt;th style="padding: 10px;border: 1px solid #dddddd"&gt;&lt;strong&gt;Flink 2.x Artifact&lt;/strong&gt;&lt;/th&gt; 
   &lt;th style="padding: 10px;border: 1px solid #dddddd"&gt;&lt;strong&gt;Version&lt;/strong&gt;&lt;/th&gt; 
  &lt;/tr&gt; 
 &lt;/thead&gt; 
 &lt;tbody&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Apache Kafka&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;flink-connector-kafka&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;4.0.0-2.0&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Amazon Kinesis Data Streams&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;flink-connector-aws-kinesis-streams&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;6.0.0-2.0&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Amazon Data Firehose&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;flink-connector-aws-kinesis-firehose&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;6.0.0-2.0&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Amazon DynamoDB&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;flink-connector-dynamodb&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;6.0.0-2.0&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Amazon Simple Queue Service (Amazon SQS)&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;flink-connector-sqs&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;6.0.0-2.0&lt;/td&gt; 
  &lt;/tr&gt; 
 &lt;/tbody&gt; 
&lt;/table&gt; 
&lt;p&gt;During writing, the JDBC, OpenSearch, and Prometheus connectors don’t yet have Flink 2.x-compatible releases. For the latest versions, see the&amp;nbsp;&lt;a href="https://docs.aws.amazon.com/managed-flink/latest/java/how-flink-connectors.html" target="_blank" rel="noopener noreferrer"&gt;Amazon Managed Service for Apache Flink connector documentation&lt;/a&gt;.&lt;/p&gt; 
&lt;p&gt;Beyond connector updates, make the following code changes:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Replace DataSet API usage with the DataStream API or Table API/SQL.&lt;/li&gt; 
 &lt;li&gt;Replace Scala API usage with the Java API.&lt;/li&gt; 
 &lt;li&gt;Verify that your build targets JDK 17.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;Build your updated application JAR and upload it to Amazon S3 with a different file name than your current JAR (for example,&amp;nbsp;my-app-flink-2.2.jar).&lt;/p&gt; 
&lt;h3&gt;Step 2: Check state compatibility&lt;/h3&gt; 
&lt;p&gt;Before upgrading, assess whether your application state is compatible with Flink 2.2. The Kryo upgrade from version 2.24 to 5.6 changes the binary format of serialized state. Applications using POJOs with Java collections (HashMap, ArrayList, HashSet) are the most common source of incompatibility.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Quick compatibility check:&lt;/strong&gt;&lt;/p&gt; 
&lt;table class="styled-table" border="1px" cellpadding="10px"&gt; 
 &lt;thead&gt; 
  &lt;tr&gt; 
   &lt;th style="padding: 10px;border: 1px solid #dddddd"&gt;&lt;strong&gt;Serialization type&lt;/strong&gt;&lt;/th&gt; 
   &lt;th style="padding: 10px;border: 1px solid #dddddd"&gt;&lt;strong&gt;Compatible?&lt;/strong&gt;&lt;/th&gt; 
  &lt;/tr&gt; 
 &lt;/thead&gt; 
 &lt;tbody&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Avro (SpecificRecord, GenericRecord)&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;&lt;img src="https://s.w.org/images/core/emoji/14.0.0/72x72/2705.png" alt="✅" class="wp-smiley" style="height: 1em; max-height: 1em;"&gt; Yes&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Protobuf&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;&lt;img src="https://s.w.org/images/core/emoji/14.0.0/72x72/2705.png" alt="✅" class="wp-smiley" style="height: 1em; max-height: 1em;"&gt; Yes&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;POJOs without collections&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;&lt;img src="https://s.w.org/images/core/emoji/14.0.0/72x72/2705.png" alt="✅" class="wp-smiley" style="height: 1em; max-height: 1em;"&gt; Yes&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Custom TypeSerializers (no Kryo delegation)&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;&lt;img src="https://s.w.org/images/core/emoji/14.0.0/72x72/2705.png" alt="✅" class="wp-smiley" style="height: 1em; max-height: 1em;"&gt; Yes&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;POJOs with Java collections&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;&lt;img src="https://s.w.org/images/core/emoji/14.0.0/72x72/274c.png" alt="❌" class="wp-smiley" style="height: 1em; max-height: 1em;"&gt; No&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Scala case classes&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;&lt;img src="https://s.w.org/images/core/emoji/14.0.0/72x72/274c.png" alt="❌" class="wp-smiley" style="height: 1em; max-height: 1em;"&gt; No&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Types using Kryo fallback&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;&lt;img src="https://s.w.org/images/core/emoji/14.0.0/72x72/274c.png" alt="❌" class="wp-smiley" style="height: 1em; max-height: 1em;"&gt; No&lt;/td&gt; 
  &lt;/tr&gt; 
 &lt;/tbody&gt; 
&lt;/table&gt; 
&lt;p&gt;&lt;strong&gt;Check your logs for Kryo fallback:&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;Search your application logs for this pattern, which indicates a type is falling back to Kryo serialization:&lt;code&gt;Class class &amp;lt;className&amp;gt; cannot be used as a POJO type&lt;/code&gt;&lt;/p&gt; 
&lt;h3&gt;Step 3: Turn on auto-rollback and automatic snapshots&lt;/h3&gt; 
&lt;p&gt;Turn on auto-rollback so the service automatically reverts to the previous version if the upgrade fails. Also, verify that automatic snapshots are turned on. The service takes a snapshot before the upgrade that serves as your rollback point.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Check current settings:&lt;/strong&gt;&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-css"&gt;aws kinesisanalyticsv2 describe-application \
    --application-name MyApplication \
    --query 'ApplicationDetail.ApplicationConfigurationDescription.{
        AutoRollback: ApplicationSystemRollbackConfigurationDescription.RollbackEnabled,
        AutoSnapshots: ApplicationSnapshotConfigurationDescription.SnapshotsEnabled
    }'&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;&lt;strong&gt;Turn on both if they’re not already active:&lt;/strong&gt;&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-sql"&gt;aws kinesisanalyticsv2 update-application \
    --application-name MyApplication \
    --current-application-version-id &amp;lt;version-id&amp;gt; \
    --application-configuration-update '{
        "ApplicationSystemRollbackConfigurationUpdate": {
            "RollbackEnabledUpdate": true
        },
        "ApplicationSnapshotConfigurationUpdate": {
            "SnapshotsEnabledUpdate": true
        }
    }'&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;h3&gt;Step 4: Take a manual snapshot (recommended)&lt;/h3&gt; 
&lt;p&gt;Although the upgrade process takes an automatic snapshot, taking a manual snapshot gives you a named restore point that you can quickly identify.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;aws kinesisanalyticsv2 create-application-snapshot \
    --application-name MyApplication \
    --snapshot-name pre-flink-2.2-upgrade&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Verify that the snapshot is ready before proceeding:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;aws kinesisanalyticsv2 describe-application-snapshot \
    --application-name MyApplication \
    --snapshot-name pre-flink-2.2-upgrade&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Wait until&amp;nbsp;SnapshotStatus&amp;nbsp;is&amp;nbsp;READY.&lt;/p&gt; 
&lt;h3&gt;Step 5: Run the upgrade&lt;/h3&gt; 
&lt;p&gt;Run the upgrade while the application is in&amp;nbsp;RUNNING&amp;nbsp;or&amp;nbsp;READY&amp;nbsp;(stopped) state. The following example upgrades a running application and points to the new JAR.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;AWS CLI:&lt;/strong&gt;&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-sql"&gt;aws kinesisanalyticsv2 update-application \
    --application-name MyApplication \
    --current-application-version-id &amp;lt;version-id&amp;gt; \
    --runtime-environment-update FLINK-2_2 \
    --application-configuration-update '{
        "ApplicationCodeConfigurationUpdate": {
            "CodeContentUpdate": {
                "S3ContentLocationUpdate": {
                    "FileKeyUpdate": "my-app-flink-2.2.jar"
                }
            }
        }
    }'&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;&lt;strong&gt;AWS Management Console:&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;To upgrade from the console, follow these steps:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;Navigate to your application in the Amazon Managed Service for Apache Flink console.&lt;/li&gt; 
 &lt;li&gt;Choose&amp;nbsp;&lt;strong&gt;Configure&lt;/strong&gt;.&lt;/li&gt; 
 &lt;li&gt;Select the&amp;nbsp;&lt;strong&gt;Flink 2.2&lt;/strong&gt;&amp;nbsp;runtime.&lt;/li&gt; 
 &lt;li&gt;Point to your new application JAR on Amazon S3.&lt;/li&gt; 
 &lt;li&gt;Select the snapshot to restore from (use&amp;nbsp;&lt;strong&gt;Latest&lt;/strong&gt;&amp;nbsp;to start from the most recent snapshot).&lt;/li&gt; 
 &lt;li&gt;Choose&amp;nbsp;&lt;strong&gt;Update&lt;/strong&gt;.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;strong&gt;AWS CloudFormation:&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;Update the&amp;nbsp;&lt;code&gt;RuntimeEnvironment&lt;/code&gt;&amp;nbsp;field in your template. AWS CloudFormation now performs an in-place update instead of deleting and recreating the application.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Terraform:&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;If you manage your Flink application with Terraform, you can perform the same in-place upgrade by updating the&amp;nbsp;&lt;code&gt;runtime_environment&lt;/code&gt; and code reference in your&amp;nbsp;aws_kinesisanalyticsv2_application&amp;nbsp;resource. Note: Terraform support for&amp;nbsp;FLINK-2_2&amp;nbsp;requires AWS provider version 6.40.0 or later (released April 8, 2026). Earlier provider versions don’t recognize this runtime value. First, update your provider version constraint:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-css"&gt;terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
      version = "&amp;gt;= 6.40.0"
    }
  }
}&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Then run&amp;nbsp;terraform init -upgrade&amp;nbsp;to pull the new provider.Next, update your application resource. Change&amp;nbsp;&lt;code&gt;runtime_environment&lt;/code&gt;&amp;nbsp;from&amp;nbsp;“FLINK-1_20”&amp;nbsp;to&amp;nbsp;“FLINK-2_2”&amp;nbsp;and point to your new JAR:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-typescript"&gt;resource "aws_kinesisanalyticsv2_application" "my_app" {
  name                   = "MyApplication"
  runtime_environment    = "FLINK-2_2"
  service_execution_role = aws_iam_role.flink.arn
  application_configuration {
    application_code_configuration {
      code_content_type = "ZIPFILE"
      code_content {
        s3_content_location {
          bucket_arn = aws_s3_bucket.app_code.arn
          file_key   = "my-app-flink-2.2.jar"
        }
      }
    }
    application_snapshot_configuration {
      snapshots_enabled = true
    }
    flink_application_configuration {
      checkpoint_configuration {
        configuration_type = "DEFAULT"
      }
      monitoring_configuration {
        configuration_type = "CUSTOM"
        log_level          = "INFO"
        metrics_level      = "APPLICATION"
      }
      parallelism_configuration {
        auto_scaling_enabled = true
        configuration_type   = "CUSTOM"
        parallelism          = 4
        parallelism_per_kpu  = 1
      }
    }
  }
}&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Run the upgrade:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;terraform plan    # Review the in-place update
terraform apply   # Apply the runtime change&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Terraform will perform an in-place update of the application, changing the runtime version and code location. The application will restart with the new Flink 2.2 runtime. To roll back with Terraform, revert &lt;code&gt;runtime_environment&lt;/code&gt;&amp;nbsp;to&amp;nbsp;“FLINK-1_20”, point&amp;nbsp;&lt;code&gt;file_key&lt;/code&gt;&amp;nbsp;back to your original JAR, and run&amp;nbsp;terraform apply&amp;nbsp;again. Note that you cannot restore a Flink 2.2 snapshot on Flink 1.x, so the rollback will start from the last Flink 1.x snapshot.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Important Terraform considerations:&lt;/strong&gt;&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Auto-rollback and the &lt;code&gt;RollbackApplication&lt;/code&gt; API aren’t directly exposed as Terraform resource attributes. If you need auto-rollback during the upgrade, enable it using the AWS CLI (Step 3) before running&amp;nbsp;terraform apply, or use a provisioner/null_resource to call the CLI.&lt;/li&gt; 
 &lt;li&gt;Always take a manual snapshot (Step 4) before running&amp;nbsp;terraform apply&amp;nbsp;for the upgrade. Terraform doesn’t automatically snapshot before updating the runtime.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h3&gt;Step 6: Monitor the upgrade&lt;/h3&gt; 
&lt;p&gt;After initiating the upgrade, monitor the application to verify that it completes successfully.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Check application status:&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;The application should transition through&amp;nbsp;RUNNING&amp;nbsp;→&amp;nbsp;UPDATING&amp;nbsp;→&amp;nbsp;RUNNING. Confirm the runtime version changed to 2.2:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;aws kinesisanalyticsv2 describe-application \
    --application-name MyApplication \
    --query 'ApplicationDetail.RuntimeEnvironment'&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;&lt;strong&gt;What to watch for:&lt;/strong&gt;&lt;/p&gt; 
&lt;table class="styled-table" border="1px" cellpadding="10px"&gt; 
 &lt;thead&gt; 
  &lt;tr&gt; 
   &lt;th style="padding: 10px;border: 1px solid #dddddd"&gt;&lt;strong&gt;Scenario&lt;/strong&gt;&lt;/th&gt; 
   &lt;th style="padding: 10px;border: 1px solid #dddddd"&gt;&lt;strong&gt;What happens&lt;/strong&gt;&lt;/th&gt; 
   &lt;th style="padding: 10px;border: 1px solid #dddddd"&gt;&lt;strong&gt;Action&lt;/strong&gt;&lt;/th&gt; 
  &lt;/tr&gt; 
 &lt;/thead&gt; 
 &lt;tbody&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Binary incompatibility&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Upgrade operation fails. Auto-rollback reverts to the previous version automatically.&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Check operation logs for the exception, fix your code, and retry.&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;State incompatibility&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Upgrade appears to succeed but the application enters restart loops.&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Monitor&amp;nbsp;&lt;code&gt;numRestarts&lt;/code&gt;&amp;nbsp;metric. If restarts are continuous, invoke the Rollback API manually. Review the [State Compatibility Guide].&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Successful upgrade&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;&lt;code&gt;numRestarts&lt;/code&gt;&amp;nbsp;is zero,&amp;nbsp;uptime&amp;nbsp;is increasing, checkpoints are completing.&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Proceed to validation.&lt;/td&gt; 
  &lt;/tr&gt; 
 &lt;/tbody&gt; 
&lt;/table&gt; 
&lt;p&gt;&lt;strong&gt;Key CloudWatch metrics to monitor:&lt;/strong&gt;&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;&lt;code&gt;numRestarts&lt;/code&gt;: should be zero after upgrade&lt;/li&gt; 
 &lt;li&gt;&lt;code&gt;lastCheckpointDuration&lt;/code&gt;: should be similar to pre-upgrade values&lt;/li&gt; 
 &lt;li&gt;&lt;code&gt;numberOfFailedCheckpoints&lt;/code&gt;: should remain at zero&lt;/li&gt; 
 &lt;li&gt;&lt;code&gt;uptime&lt;/code&gt;: should be steadily increasing&lt;/li&gt; 
&lt;/ol&gt; 
&lt;h3&gt;Step 7: Validate application behavior&lt;/h3&gt; 
&lt;p&gt;After the application is running on Flink 2.2:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Confirm that data is being read from sources and written to sinks.&lt;/li&gt; 
 &lt;li&gt;Compare the output with your pre-upgrade baseline.&lt;/li&gt; 
 &lt;li&gt;Monitor latency, throughput, checkpoint duration, and resource utilization.&lt;/li&gt; 
 &lt;li&gt;Run for at least 24 hours to confirm stable behavior: no memory leaks, no unexpected restarts, consistent checkpoint sizes.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h3&gt;Step 8: Rollback (if needed)&lt;/h3&gt; 
&lt;p&gt;If the application is running but is unhealthy after the upgrade, invoke the Rollback API:&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;AWS CLI:&lt;/strong&gt;&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;aws kinesisanalyticsv2 rollback-application \
    --application-name MyApplication \
    --current-application-version-id &amp;lt;version-id&amp;gt;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;&lt;strong&gt;AWS Management Console:&lt;/strong&gt;&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Navigate to your application.&lt;/li&gt; 
 &lt;li&gt;Choose&amp;nbsp;&lt;strong&gt;Actions&lt;/strong&gt;, &lt;strong&gt;Roll back&lt;/strong&gt;.&lt;/li&gt; 
 &lt;li&gt;Confirm the rollback.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;During rollback, the application stops, reverts to the previous Flink version and application code, and restarts from the snapshot taken before the upgrade.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Important:&lt;/strong&gt;&amp;nbsp;You can’t restore a Flink 2.2 snapshot on Flink 1.x. Rollback uses the snapshot taken before the upgrade. This is why Steps 3 and 4 are critical.&lt;/p&gt; 
&lt;h2&gt;Next steps&lt;/h2&gt; 
&lt;p&gt;Your path depends on where you are today:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;&lt;strong&gt;If you’re new to Apache Flink:&lt;/strong&gt;&amp;nbsp;Start with the&amp;nbsp;&lt;a href="https://file+.vscode-resource.vscode-cdn.net/Users/fmorillo/Downloads/blog/link" target="_blank" rel="noopener noreferrer"&gt;guide to choosing the right API and language&lt;/a&gt;, the&amp;nbsp;&lt;a href="https://docs.aws.amazon.com/managed-flink/latest/java/getting-started.html" target="_blank" rel="noopener noreferrer"&gt;Amazon Managed Service for Apache Flink getting started guide&lt;/a&gt;, and the&amp;nbsp;&lt;a href="https://catalog.workshops.aws/managed-flink" target="_blank" rel="noopener noreferrer"&gt;Amazon Managed Service for Apache Flink workshop&lt;/a&gt;.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;If you’re running Flink 1.x in production:&lt;/strong&gt;&amp;nbsp;Follow the migration steps in this post on a non-production replica first, then apply to production. For the complete reference, see the&amp;nbsp;&lt;a href="https://file+.vscode-resource.vscode-cdn.net/Users/fmorillo/Downloads/blog/link" target="_blank" rel="noopener noreferrer"&gt;Upgrading to Flink 2.2: Complete Guide&lt;/a&gt;&amp;nbsp;and the&amp;nbsp;&lt;a href="https://file+.vscode-resource.vscode-cdn.net/Users/fmorillo/Downloads/blog/link" target="_blank" rel="noopener noreferrer"&gt;State Compatibility Guide for Flink 2.2 Upgrades&lt;/a&gt;.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;If you’re evaluating Flink 2.2 features:&lt;/strong&gt;&amp;nbsp;Launch a new application on the Flink 2.2 runtime to explore SQL/ML capabilities, the VARIANT data type, and the new join operators. See the&amp;nbsp;&lt;a href="https://github.com/aws-samples/amazon-managed-service-for-apache-flink-examples" target="_blank" rel="noopener noreferrer"&gt;Amazon Managed Service for Apache Flink sample applications on GitHub&lt;/a&gt;&amp;nbsp;for reference architectures.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;If you need help with your migration:&lt;/strong&gt;&amp;nbsp;Use the &lt;a href="https://github.com/awslabs/managed-service-for-apache-flink-agent-steering-files" target="_blank" rel="noopener noreferrer"&gt;Kiro Power and Agent Skill for Amazon Managed Service for Apache Flink&lt;/a&gt; to identify compatibility issues in your existing codebase and receive guidance on refactoring steps. You can also open a case through&amp;nbsp;&lt;a href="https://aws.amazon.com/support/" target="_blank" rel="noopener noreferrer"&gt;AWS Support&lt;/a&gt;, post a question on&amp;nbsp;&lt;a href="https://repost.aws/tags/TAjj_AYVQYR-a2FMqOkMcEPg/amazon-managed-service-for-apache-flink" target="_blank" rel="noopener noreferrer"&gt;AWS re:Post for Amazon Managed Service for Apache Flink&lt;/a&gt;, or reach out through the&amp;nbsp;&lt;a href="https://flink.apache.org/community/" target="_blank" rel="noopener noreferrer"&gt;Apache Flink community&lt;/a&gt;.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;For the Apache Flink 2.2 documentation, see&amp;nbsp;&lt;a href="https://nightlies.apache.org/flink/flink-docs-release-2.2/" target="_blank" rel="noopener noreferrer"&gt;nightlies.apache.org/flink/flink-docs-release-2.2&lt;/a&gt;. For Amazon Managed Service for Apache Flink documentation, see the&amp;nbsp;&lt;a href="https://docs.aws.amazon.com/managed-flink/latest/java/what-is.html" target="_blank" rel="noopener noreferrer"&gt;Developer Guide&lt;/a&gt;. For pricing, see the&amp;nbsp;&lt;a href="https://aws.amazon.com/managed-service-apache-flink/pricing/" target="_blank" rel="noopener noreferrer"&gt;pricing page&lt;/a&gt;.&lt;/p&gt; 
&lt;h2&gt;Conclusion&lt;/h2&gt; 
&lt;p&gt;With Apache Flink 2.2 on Amazon Managed Service for Apache Flink, you get a modern Java 17 runtime, SQL-native AI/ML inference, improved state management performance, and a streamlined API surface. In-place upgrades with state preservation and auto-rollback make the migration straightforward. Test on a replica, follow the steps in this post, and start building on Flink 2.2.&lt;/p&gt; 
&lt;hr style="width: 80%"&gt; 
&lt;h2&gt;About the authors&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignnone size-medium wp-image-90329" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/16/fmorillo-225x300.jpg" alt="" width="225" height="300"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Francisco Morillo&lt;/h3&gt; 
  &lt;p&gt;Francisco Morillo&amp;nbsp;is a Sr. Streaming Specialist Solutions Architect at AWS, helping customers design and operate real-time data processing applications using Amazon Managed Service for Apache Flink and Amazon Managed Streaming for Apache Kafka.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignnone wp-image-90607 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/23/profilepic.jpg" alt="" width="940" height="1072"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Mayank Juneja&lt;/h3&gt; 
  &lt;p&gt;Mayank Juneja is a Senior Product Manager at AWS, leading Amazon Managed Service for Apache Flink. He lives at the intersection of real-time data streaming and AI, previously driving Flink SQL and AI inference products at Confluent.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
					
			
		
		
			</item>
		<item>
		<title>Using Apache Sedona with AWS Glue to process billions of daily points from a geospatial dataset</title>
		<link>https://aws.amazon.com/blogs/big-data/using-apache-sedona-with-aws-glue-to-process-billions-of-daily-points-from-a-geospatial-dataset/</link>
					
		
		<dc:creator><![CDATA[Ruan Roloff]]></dc:creator>
		<pubDate>Wed, 22 Apr 2026 15:42:28 +0000</pubDate>
				<category><![CDATA[Advanced (300)]]></category>
		<category><![CDATA[Analytics]]></category>
		<category><![CDATA[AWS Glue]]></category>
		<category><![CDATA[Open Source]]></category>
		<category><![CDATA[Technical How-to]]></category>
		<guid isPermaLink="false">36afbfac5e6c3da1aa34ee703d844eb614c66772</guid>

					<description>In this post, we explore how to use Apache Sedona with AWS Glue to process and analyze massive geospatial datasets.</description>
										<content:encoded>&lt;p&gt;Data strategy can use geospatial data to provide organizations with insights for decision-making and operational optimization. By incorporating geospatial data (such as GPS coordinates, points, polygons and geographic boundaries), businesses can uncover patterns, trends, and relationships that might otherwise remain hidden across multiple industries, from aviation and transportation to environmental studies and urban planning. Processing and analyzing this geospatial data at scale can be challenging, especially when dealing with billions of daily observations.&lt;/p&gt; 
&lt;p&gt;In this post, we explore how to use &lt;a href="https://sedona.apache.org/latest/" target="_blank" rel="noopener noreferrer"&gt;Apache Sedona&lt;/a&gt; with &lt;a href="https://aws.amazon.com/glue/" target="_blank" rel="noopener noreferrer"&gt;AWS Glue&lt;/a&gt; to process and analyze massive geospatial datasets.&lt;/p&gt; 
&lt;h2&gt;Introduction to geospatial data&lt;/h2&gt; 
&lt;p&gt;Geospatial data is information that has a geographic component. It describes objects, events, or phenomena along with their location on the Earth’s surface. This data includes coordinates (latitude and longitude), shapes (points, lines, polygons), and associated attributes (such as the name of a city or the type of road).&lt;/p&gt; 
&lt;p&gt;Key types of geospatial geometries (and examples of each in parentheses) include:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Point –&lt;/strong&gt; Represents a single coordinate (a weather station).&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;MultiPoint –&lt;/strong&gt; A collection of points (bus stops in a city).&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;LineString –&lt;/strong&gt; A series of points connected in a line (a river or a flight path).&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;MultiLineString –&lt;/strong&gt; Multiple lines (multiple flight routes).&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Polygon –&lt;/strong&gt; A closed area (the boundary of a city).&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;MultiPolygon –&lt;/strong&gt; Multiple polygons (national parks in a country).&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;Geospatial datasets come in different formats, each designed to store and represent different types of geographic information. Common formats for geospatial data are vector formats (Shapefile, GeoJSON), raster formats (GeoTIFF, ESRI Grid), GPS formats (GPX, NMEA), web formats (WMS, GeoRSS) among others.&lt;/p&gt; 
&lt;h2&gt;Core concepts of Apache Sedona&lt;/h2&gt; 
&lt;p&gt;&lt;a href="https://sedona.apache.org/" target="_blank" rel="noopener noreferrer"&gt;Apache Sedona&lt;/a&gt; is an open-source computing framework for processing large-scale geospatial data. Built on top of &lt;a href="https://spark.apache.org/" target="_blank" rel="noopener noreferrer"&gt;Apache Spark&lt;/a&gt;, Sedona extends Spark’s capabilities to handle spatial operations efficiently. At its core, Sedona introduces several key concepts that enable distributed spatial processing. These include Spatial Resilient Distributed Datasets (SRDDs), which allow for the distribution of spatial data across a cluster, and Spatial SQL, which provides a familiar SQL-like interface for spatial queries. Some of the core capabilities of Apache Sedona are:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Efficient spatial data types like points, lines and polygons.&lt;/li&gt; 
 &lt;li&gt;Spatial operations and functions such as &lt;code&gt;ST_Contains&lt;/code&gt; (check if point is inside of a polygon), &lt;code&gt;ST_Intersects&lt;/code&gt; (check if point is inside of a polygon), &lt;code&gt;ST_H3CellIDs&lt;/code&gt; (geospatial indexing system developed by Uber, return the &lt;a href="https://h3geo.org/" target="_blank" rel="noopener noreferrer"&gt;H3&lt;/a&gt; cell ID(s) that contain the given point at the specified resolution).&lt;/li&gt; 
 &lt;li&gt;Spatial joins to combine different spatial datasets.&lt;/li&gt; 
 &lt;li&gt;Integration with Spark SQL (geospatial functions to run spatial SQL queries).&lt;/li&gt; 
 &lt;li&gt;Spatial indexing techniques, such as quad-trees and R-trees, to optimize query performance.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;For more information about the functions available in Apache Sedona, visit the official Sedona &lt;a href="https://sedona.apache.org/1.8.0/api/sql/Function/" target="_blank" rel="noopener noreferrer"&gt;Functions&lt;/a&gt; documentation.&lt;/p&gt; 
&lt;h2&gt;Use case&lt;/h2&gt; 
&lt;p&gt;This use case consists of a global air traffic visualization and analysis platform that processes and displays real-time or historical aircraft tracking data on an interactive world map. Using unique aircraft identifiers from the International Civic Aviation Organization (ICAO), the system ingests trajectory records containing information such as geographic position (latitude and longitude), altitude, speed, and flight direction, then transforms this raw data into two complementary visual layers. The Flight Tracks Layer plots the routes traveled by each aircraft individually, allowing for the analysis of specific trajectories and navigation patterns. The Flight Density Layer uses hexagonal spatial indexing (H3) to aggregate and identify regions of higher air traffic concentration worldwide, revealing busy air corridors, aviation hubs, and high-density flight zones.&lt;/p&gt; 
&lt;p&gt;The dataset used for this use case is &lt;a href="https://www.adsb.lol/docs/open-data/historical/" target="_blank" rel="noopener noreferrer"&gt;historical flight tracker data&lt;/a&gt; from &lt;a href="https://www.adsb.lol/" target="_blank" rel="noopener noreferrer"&gt;ADSB.lol&lt;/a&gt;. ADSB.lol provides unfiltered flight tracker with a focus on open data. Data is also freely available via the API. The data contains a file per aircraft, a JSON gzip file containing the data for that aircraft for the day.&lt;/p&gt; 
&lt;p&gt;This is a JSON trace file format sample:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-typescript"&gt;{
    icao: "0123ac", // hex id of the aircraft
    timestamp: 1609275898.495, // unix timestamp in seconds since epoch (1970)
    trace: [
        [ seconds after timestamp,
            lat,
            lon,
            altitude in ft or "ground" or null,
            ground speed in knots or null,
            track in degrees or null, (if altitude == "ground", this will be true heading instead of track)
            flags as a bitfield: (use bitwise and to extract data)
                (flags &amp;amp; 1 &amp;gt; 0): position is stale (no position received for 20 seconds before this one)
                (flags &amp;amp; 2 &amp;gt; 0): start of a new leg (tries to detect a separation point between landing and takeoff that separates flights)
                (flags &amp;amp; 4 &amp;gt; 0): vertical rate is geometric and not barometric
                (flags &amp;amp; 8 &amp;gt; 0): altitude is geometric and not barometric
             ,
            vertical rate in fpm or null,
            aircraft object with extra details or null,
            type / source of this position or null,
            geometric altitude or null,
            geometric vertical rate or null,
            indicated airspeed or null,
            roll angle or null
        ],
    ]
}&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;For this use case, this is a simplified schema of the dataset after processing:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;code&gt;icao -&lt;/code&gt; Unique aircraft identifier&lt;/li&gt; 
 &lt;li&gt;&lt;code&gt;timestamp -&lt;/code&gt; Epoch timestamp of the observation (converted to readable format)&lt;/li&gt; 
 &lt;li&gt;&lt;code&gt;trace.lat / trace.lon -&lt;/code&gt; Latitude and longitude of the aircraft&lt;/li&gt; 
 &lt;li&gt;&lt;code&gt;trace.altitude -&lt;/code&gt; Aircraft altitude&lt;/li&gt; 
 &lt;li&gt;&lt;code&gt;trace.ground_speed -&lt;/code&gt; Ground speed&lt;/li&gt; 
 &lt;li&gt;&lt;code&gt;geometry -&lt;/code&gt; Geospatial geometry of the observation point (&lt;code&gt;Point&lt;/code&gt;)&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;Solution overview&lt;/h2&gt; 
&lt;p&gt;This solution enables aircraft tracking and analysis. The data can be visualized on maps and used for aviation management and safety applications. The process begins with data acquisition, extracting the compressed JSON files from TAR archives, then transforms this raw data into geospatial objects, aggregating them into H3 cells for efficient analysis. The processed data schema includes ICAO aircraft identifiers, timestamps, latitude/longitude coordinates, and derived fields such as H3 cell identifiers and point counts per cell. This structure allows detailed tracking of individual flights and aggregate analysis of traffic patterns. For visualization, you can generate density maps using the H3 grid system and create visual representations of individual flight tracks. The architecture data flow is as follows:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Data ingestion –&lt;/strong&gt; Aircraft observation data stored as JSON compressed files in &lt;a href="https://aws.amazon.com/s3/" target="_blank" rel="noopener noreferrer"&gt;Amazon Simple Storage Service&lt;/a&gt; (Amazon S3).&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Data processing –&lt;/strong&gt; AWS Glue jobs using Apache Sedona for geospatial processing.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Data visualization –&lt;/strong&gt; Spark SQL with Sedona’s spatial functions to extract insights and export data to visualize the information in a map on Kepler.gl.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;The following figure illustrates this solution.&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90098" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/13/BDB-5249-geospatial-1_v2.png" alt="AWS architecture diagram showing a geospatial data processing pipeline." width="761" height="728"&gt;&lt;/p&gt; 
&lt;h3&gt;Prerequisites&lt;/h3&gt; 
&lt;p&gt;You will need the following for this solution:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;An &lt;a href="https://aws.amazon.com/resources/create-account/" target="_blank" rel="noopener noreferrer"&gt;AWS Account&lt;/a&gt; and a user with AWS Console access.&lt;/li&gt; 
 &lt;li&gt;Access to a Linux terminal and the &lt;a href="https://docs.aws.amazon.com/cli/latest/userguide/getting-started-quickstart.html" target="_blank" rel="noopener noreferrer"&gt;AWS Command Line Interface&lt;/a&gt; (AWS CLI).&lt;/li&gt; 
 &lt;li&gt;An &lt;a href="https://docs.aws.amazon.com/glue/latest/dg/create-an-iam-role.html" target="_blank" rel="noopener noreferrer"&gt;IAM role for AWS Glue&lt;/a&gt; with list, read, and write permissions for Amazon S3 buckets.&lt;/li&gt; 
 &lt;li&gt;An Amazon S3 Bucket for flight files. For this example, name the bucket &lt;code&gt;blog-sedona-nessie-&amp;lt;account_number&amp;gt;-&amp;lt;aws_region&amp;gt;&lt;/code&gt;, using your account number and region.&lt;/li&gt; 
 &lt;li&gt;An Amazon S3 bucket for artifacts and Sedona libraries. For this example, name the bucket &lt;code&gt;blog-sedona-artifacts-&amp;lt;account_number&amp;gt;-&amp;lt;aws_region&amp;gt;&lt;/code&gt;, using your account number and region.&lt;/li&gt; 
 &lt;li&gt;Download a day of historical data from &lt;a href="https://www.adsb.lol/docs/open-data/historical/" target="_blank" rel="noopener noreferrer"&gt;ADSB.lol&lt;/a&gt;. In our examples, we used &lt;a href="https://github.com/adsblol/globe_history_2025/releases/download/v2025.05.29-planes-readsb-prod-0tmp/v2025.05.29-planes-readsb-prod-0tmp.tar.aa" target="_blank" rel="noopener noreferrer"&gt;v2025.05.29-planes-readsb-prod-0tmp.tar.aa&lt;/a&gt; and &lt;a href="https://github.com/adsblol/globe_history_2025/releases/download/v2025.05.29-planes-readsb-prod-0tmp/v2025.05.29-planes-readsb-prod-0tmp.tar.ab" target="_blank" rel="noopener noreferrer"&gt;v2025.05.29-planes-readsb-prod-0tmp.tar.ab&lt;/a&gt;.&lt;/li&gt; 
 &lt;li&gt;Download the Apache Sedona libraries. The example was created using &lt;a href="https://repo1.maven.org/maven2/org/apache/sedona/sedona-spark-shaded-3.5_2.12/1.7.1/sedona-spark-shaded-3.5_2.12-1.7.1.jar" target="_blank" rel="noopener noreferrer"&gt;sedona-spark-shaded-3.5_2.12-1.7.1.jar&lt;/a&gt; and &lt;a href="https://repo1.maven.org/maven2/org/datasyslab/geotools-wrapper/1.7.1-28.5/geotools-wrapper-1.7.1-28.5.jar" target="_blank" rel="noopener noreferrer"&gt;geotools-wrapper-1.7.1-28.5.jar&lt;/a&gt;.&lt;/li&gt; 
 &lt;li&gt;Download the &lt;a href="https://github.com/aws-samples/sample-blog-geospacial-lake-on-aws-with-aws-dataservices/blob/main/src/glue_scripts/process_sedona_geo_track.py" target="_blank" rel="noopener noreferrer"&gt;AWS Glue script&lt;/a&gt; from AWS Sample to process the geospatial data.&lt;/li&gt; 
 &lt;li&gt;Review the &lt;a href="https://docs.aws.amazon.com/glue/latest/dg/security.html" target="_blank" rel="noopener noreferrer"&gt;AWS Glue security best practices&lt;/a&gt;, especially IAM least-privilege, encryption for sensitive data at rest and in transit, and configuring VPC Endpoints to prevent data from routing through the public internet.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;Solution walkthrough&lt;/h2&gt; 
&lt;p&gt;From now on, executing the next steps will incur costs on AWS. This step-by-step walkthrough demonstrates an approach to processing and analyzing large-scale geospatial flight data using Apache Sedona and Uber’s H3 spatial indexing system, using AWS Glue for distributed processing and Apache Sedona for efficient geospatial computations. It explains how to ingest raw flight data, transform it using Sedona’s geospatial functions, and index it with H3 for optimized spatial queries. Finally, it also demonstrates how to visualize the data using Kepler.gl. For data processing, it is possible to use both Glue scripts and &lt;a href="https://sedona.apache.org/latest/setup/glue/" target="_blank" rel="noopener noreferrer"&gt;Glue notebooks&lt;/a&gt;. In this post, we will focus only on Glue scripts.&lt;/p&gt; 
&lt;h3&gt;Upload the Apache Sedona libraries to Amazon S3&lt;/h3&gt; 
&lt;ol&gt; 
 &lt;li&gt;Open your OS terminal command line.&lt;/li&gt; 
 &lt;li&gt;Create a folder to download the Sedona libraries and name it &lt;strong&gt;jar&lt;/strong&gt;. &lt;pre&gt;&lt;code class="lang-bash"&gt;
	# Create a directory for the Sedona libraries (JARs files)
	mkdir jar
	# Go to the folder JARs folder
	cd jar
	&lt;/code&gt;&lt;/pre&gt; &lt;/li&gt; 
 &lt;li&gt;Download the Apache Sedona libraries. &lt;pre&gt;&lt;code class="lang-bash"&gt;
	# Download required Sedona libraries (JARs files)
	wget https://repo1.maven.org/maven2/org/apache/sedona/sedona-spark-shaded-3.5_2.12/1.7.1/sedona-spark-shaded-3.5_2.12-1.7.1.jar
	wget https://repo1.maven.org/maven2/org/datasyslab/geotools-wrapper/1.7.1-28.5/geotools-wrapper-1.7.1-28.5.jar
	&lt;/code&gt;&lt;/pre&gt; &lt;/li&gt; 
 &lt;li&gt;Upload the Sedona libraries (JARs files) to Amazon S3. In this example, we use the S3 path &lt;code&gt;s3://aws-blog-post-sedona-artifacts/jar/&lt;/code&gt;. &lt;pre&gt;&lt;code class="lang-bash"&gt;
	# Upload the JARs files to Amazon S3 bucket
	aws s3 cp . s3://blog-sedona-artifacts-&amp;lt;account_number&amp;gt;-&amp;lt;aws_region&amp;gt;/jar/ --recursive
	&lt;/code&gt;&lt;/pre&gt; &lt;/li&gt; 
 &lt;li&gt;Your Amazon S3 folder should now look similar to the following image:&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90099" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/10/BDB-5249-geospatial-2.jpg" alt="Amazon S3 console screenshot displaying the jar folder contents in blog-sedona-artifacts bucket." width="2560" height="919"&gt;&lt;/p&gt; 
&lt;h3&gt;Download and upload the geospatial data to Amazon S3&lt;/h3&gt; 
&lt;ol&gt; 
 &lt;li&gt;Open your OS terminal command line.&lt;/li&gt; 
 &lt;li&gt;Create a folder to download the flight files and name it &lt;strong&gt;adsb_dataset&lt;/strong&gt;. &lt;pre&gt;&lt;code class="lang-bash"&gt;		# Create a directory for download the geospatial flight files
		mkdir adsb_dataset
		# Go to the folder for geospatial flight files
		cd adsb_dataset
	&lt;/code&gt;&lt;/pre&gt; &lt;/li&gt; 
 &lt;li&gt;Download the flight files data from &lt;a href="https://github.com/adsblol/globe_history_2025/releases" target="_blank" rel="noopener noreferrer"&gt;adsblol GitHub repository&lt;/a&gt;. &lt;pre&gt;&lt;code class="lang-bash"&gt;	# Download the geospatial flight files in the folder created
	wget https://github.com/adsblol/globe_history_2025/releases/download/v2025.05.29-planes-readsb-prod-0tmp/v2025.05.29-planes-readsb-prod-0tmp.tar.aa
	wget https://github.com/adsblol/globe_history_2025/releases/download/v2025.05.29-planes-readsb-prod-0tmp/v2025.05.29-planes-readsb-prod-0tmp.tar.ab
	&lt;/code&gt;&lt;/pre&gt; &lt;/li&gt; 
 &lt;li&gt;Extract the flight files. &lt;pre&gt;&lt;code class="lang-bash"&gt;	# Combine the two the tar files together
	cat v2025.05.29* &amp;gt;&amp;gt; combined.tar
	# Extract the json flight files from the tar file
	tar xf combined.tar
	&lt;/code&gt;&lt;/pre&gt; &lt;/li&gt; 
 &lt;li&gt;Copy the flight files to Amazon S3. In this case, we are using the S3 folder: &lt;code&gt;s3://blog-sedona-nessie-&amp;lt;account_number&amp;gt;-&amp;lt;aws_region&amp;gt;/raw/adsb-2025-05-28/traces/&lt;/code&gt;. &lt;pre&gt;&lt;code class="lang-bash"&gt;	# Copy the json flight files to Amazon S3
	aws s3 cp ./traces/ s3://blog-sedona-nessie-&amp;lt;account_number&amp;gt;-&amp;lt;aws_region&amp;gt;/raw/adsb-2025-05-28/traces/ --recursive
	&lt;/code&gt;&lt;/pre&gt; &lt;/li&gt; 
 &lt;li&gt;Your Amazon S3 folder should now look similar to the following image.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90100" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/10/BDB-5249-geospatial-3-scaled.jpg" alt="Amazon S3 console showing JSON trace files in the path raw/adsb-2025-05-28/traces/00/." width="2560" height="1096"&gt;&lt;/p&gt; 
&lt;h3&gt;Create an AWS Glue job and set up the job&lt;/h3&gt; 
&lt;p&gt;Now, we are ready to define the AWS Glue job using Apache Sedona to read the geospatial data files. To create a Glue job:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;Open the &lt;a href="https://console.aws.amazon.com/glue/" target="_blank" rel="noopener noreferrer"&gt;AWS Glue console&lt;/a&gt;.&lt;/li&gt; 
 &lt;li&gt;On the &lt;strong&gt;Notebooks&lt;/strong&gt; page, choose &lt;strong&gt;Script editor&lt;/strong&gt;.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90101" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/10/BDB-5249-geospatial-4-scaled.jpg" alt="AWS Glue Studio jobs creation interface showing three job creation methods: Visual ETL with data flow interface, Notebook for interactive coding, and Script editor for code authoring" width="2560" height="800"&gt;&lt;/p&gt; 
&lt;ol start="3"&gt; 
 &lt;li&gt;On the Script screen, for the engine, choose &lt;strong&gt;Spark&lt;/strong&gt;, then select the option &lt;strong&gt;Upload script&lt;/strong&gt;.&lt;/li&gt; 
 &lt;li&gt;Choose &lt;strong&gt;Choose file&lt;/strong&gt;. Find the &lt;code&gt;process_sedona_geo_track.py&lt;/code&gt; file, then choose &lt;strong&gt;Create script&lt;/strong&gt;.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90102" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/10/BDB-5249-geospatial-5.jpg" alt="Script creation dialog box with Spark engine selected. Upload script option is active, showing successfully uploaded file process_sedona_geo_track.py." width="1602" height="730"&gt;&lt;/p&gt; 
&lt;ol start="5"&gt; 
 &lt;li&gt;Rename the job from &lt;strong&gt;Untitled&lt;/strong&gt; to &lt;strong&gt;process_sedona_geo_track&lt;/strong&gt;.&lt;/li&gt; 
 &lt;li&gt;Choose &lt;strong&gt;Save&lt;/strong&gt;.&lt;/li&gt; 
 &lt;li&gt;Now, let’s set up the AWS Glue job. Choose &lt;strong&gt;Job Details.&lt;/strong&gt;&lt;/li&gt; 
 &lt;li&gt;Choose the &lt;strong&gt;IAM Role&lt;/strong&gt; created to be used with Glue. For this example, we use &lt;strong&gt;blog-glue&lt;/strong&gt;.&lt;/li&gt; 
 &lt;li&gt;Set the &lt;strong&gt;Glue version&lt;/strong&gt; to &lt;strong&gt;Glue 5.0&lt;/strong&gt; and the Worker type as needed. For this example, &lt;strong&gt;G.1X&lt;/strong&gt; is sufficient, but we use &lt;strong&gt;G.2X&lt;/strong&gt; to speed up processing.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90103" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/10/BDB-5249-geospatial-6.jpg" alt="AWS Glue job details configuration page for process_sedona_geo_track." width="2182" height="1050"&gt;&lt;/p&gt; 
&lt;ol start="10"&gt; 
 &lt;li&gt;Now, let’s import the libraries for Apache Sedona.&lt;/li&gt; 
 &lt;li&gt;In the &lt;strong&gt;Dependent JARs path&lt;/strong&gt;, type the path of the JAR files for Apache Sedona that you uploaded in the preceding steps. For this example, we used &lt;code&gt;s3://blog-sedona-artifacts-&amp;lt;account_number&amp;gt;-&amp;lt;aws_region&amp;gt;/jar/sedona-spark-shaded-3.5_2.12-1.7.1.jar,s3://blog-sedona-artifacts-&amp;lt;account_number&amp;gt;-&amp;lt;aws_region&amp;gt;/jar/geotools-wrapper-1.7.1-28.5.jar&lt;/code&gt;&lt;/li&gt; 
 &lt;li&gt;In &lt;strong&gt;Additional Python modules path&lt;/strong&gt;, enter the modules for Apache Sedona: &lt;strong&gt;apache-sedona==1.7.1,geopandas==0.13.2,shapely==2.0.1,pyproj==3.6.0,fiona==1.9.5,rtree==1.2.0&lt;/strong&gt;&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90104" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/10/BDB-5249-geospatial-7.jpg" alt="ob libraries configuration section showing Dependent JARs path pointing to S3 bucket." width="2016" height="956"&gt;&lt;/p&gt; 
&lt;ol start="13"&gt; 
 &lt;li&gt;In the &lt;strong&gt;Job parameters&lt;/strong&gt; section, in the &lt;strong&gt;Key&lt;/strong&gt; field, type &lt;strong&gt; —BUCKET_NAME&lt;/strong&gt;. For its &lt;strong&gt;Value&lt;/strong&gt;, enter your bucket name. In this example, ours is &lt;code&gt;blog-sedona-nessie-&amp;lt;account_number&amp;gt;-&amp;lt;aws_region&amp;gt;&lt;/code&gt;.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90105" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/10/BDB-5249-geospatial-8.jpg" alt="ob parameters configuration interface showing key-value pair with --BUCKET_NAME parameter." width="704" height="229"&gt;&lt;/p&gt; 
&lt;ol start="14"&gt; 
 &lt;li&gt;Choose &lt;strong&gt;Save&lt;/strong&gt;.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;h3&gt;Processing the geospatial flights data&lt;/h3&gt; 
&lt;p&gt;Before we run the job, let’s understand how the code works. First, import the Apache Sedona libraries:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-python"&gt;import json 
import gzip 
from sedona.spark import SedonaContext&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Next, initialize the Sedona context using an existing Spark session:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-python"&gt;sedona = SedonaContext.create(spark)&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;After that, create a function for handling compressed JSON data:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-python"&gt;def parse_gzip_json(byte_content):
        try:
            decompressed = gzip.decompress(byte_content)
            return json.loads(decompressed.decode('utf-8'))
        except Exception as e:
            print(f"Error during gzip parse: {str(e)}")
            return None&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Add a function to transform raw tracking data into a structured format suitable for a valid coordinates process:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-python"&gt;def flatten_records(json_obj):
    records = []
    if "trace" in json_obj and isinstance(json_obj["trace"], list):
        for point in json_obj["trace"]:
            if len(point) &amp;gt;= 3:
                lat, lon = float(point[1]), float(point[2])
                if -90 &amp;lt;= lat &amp;lt;= 90 and -180 &amp;lt;= lon &amp;lt;= 180:
                    records.append(Row(
                        icao=json_obj.get("icao", None),
                        timestamp=json_obj.get("timestamp", None),
                        lat=lat,
                        lon=lon
                    ))
    return records&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;The &lt;code&gt;flat_rdd&lt;/code&gt; variable applies these functions to the structured data from the original gzipped JSON. Each element in this RDD is a Row object representing a single data point from an aircraft’s trace, with fields for ICAO, timestamp, latitude, and longitude.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-python"&gt;flat_rdd = raw_rdd.map(lambda x: parse_gzip_json(x[1])).filter(lambda x: x is not None).flatMap(flatten_records)&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;The ADSB trace files contain a deeply nested JSON structure where the trace field holds an array of mixed-type arrays, compressed in Gzip format. For this specific case, developing a UDF represented one of the most practical and efficient solutions. Since Gzip is a non-splittable format, Spark is unable to parallelize processing, constraining both methods to a single worker per file and processing the data multiple times across JVM decompression, full JSON parsing, and subsequent re-parsing operations. The UDF bypasses all of this by reading raw bytes and doing everything in a single Python pass: decompress → parse → extract → validate, returning only the small set of needed fields directly to Spark.&lt;/p&gt; 
&lt;p&gt;The Spark SQL query processes geographic trace data using the H3 hexagonal grid system, converting point data into a regularized hexagonal grid that can help identify areas of high point density. A &lt;a href="https://h3geo.org/docs/core-library/restable/#average-area-in-km2" target="_blank" rel="noopener noreferrer"&gt;resolution&lt;/a&gt; of 5 was adopted, producing hexagons of approximately 253 km² (roughly the same size as the city of Edinburgh, Scotland, which is approximately 264 km²), for its ability to effectively capture route density patterns at the city and metropolitan level.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-sql"&gt;h3_traces_df = spark.sql("""
WITH base_h3 AS (
    SELECT
        ST_H3CellIDs(geometry, 5, false)[0] AS h3_index,
        lat,
        lon
    FROM traces
)
SELECT
    COUNT(*) AS num, -- Count points in each H3 cell
    h3_index,
    AVG(lon) AS center_lon,
    AVG(lat) AS center_lat
FROM base_h3
GROUP BY h3_index
""")
&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Finally, this code prepares the datasets for visualization purposes. The first dataset is based on the aircraft unique identifier. The complete dataset for a single day can contain more than 80 million data points. A random sampling rate of 0.1% was applied, which proves sufficient to illustrate route density patterns without overwhelming the Kepler.gl browser renderer. The second dataset aggregates trace points into hexagonal spatial cells (result from the query above).&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-python"&gt;points_viz_sampled = df_points.select(
    col("icao"), # Aircraft unique identifier (24-bit address)
    col("timestamp").cast("double").alias("timestamp"),
    col("lat").cast("double").alias("lat"),
    col("lon").cast("double").alias("lon")
).sample(False, 0.001)

h3_viz_csv = h3_traces_df.select(
    col("num").alias("point_count"),
    col("h3_index").cast("string").alias("h3_index"),
    col("center_lon"),
    col("center_lat")
)&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Now that we understand the code, let’s run it.&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;Open the &lt;a href="https://console.aws.amazon.com/glue/" target="_blank" rel="noopener noreferrer"&gt;AWS Glue console&lt;/a&gt;.&lt;/li&gt; 
 &lt;li&gt;On the &lt;strong&gt;ETL jobs &amp;gt;&amp;gt; Notebooks &lt;/strong&gt;page, choose the job name &lt;strong&gt;process_sedona_geo_track&lt;/strong&gt;.&lt;/li&gt; 
 &lt;li&gt;Choose &lt;strong&gt;Run&lt;/strong&gt;.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90106" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/10/BDB-5249-geospatial-9.jpg" alt="Python script editor showing import statements for process_sedona_geo_track job." width="1038" height="417"&gt;&lt;/p&gt; 
&lt;ol start="4"&gt; 
 &lt;li&gt;Now, it is possible to monitor the job by choosing the &lt;strong&gt;Runs&lt;/strong&gt; tab.&lt;/li&gt; 
 &lt;li&gt;It may take a few minutes to run the entire job. It took nearly 8 minutes to process approximately 2.50 GB (67,540 compressed files) with 20 DPUs. After the job is processed, you should see your job with the status &lt;strong&gt;Succeeded&lt;/strong&gt;.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90107" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/10/BDB-5249-geospatial-10.jpg" alt="Job runs monitoring dashboard showing successful execution on June 5, 2025, running from 12:28:03 to 12:36:37 with 8 minutes 19 seconds duration." width="1253" height="785"&gt;&lt;/p&gt; 
&lt;p&gt;Now your data should be saved for a preview visualization demo in a folder named &lt;code&gt;s3://blog-sedona-nessie-&amp;lt;account_number&amp;gt;-&amp;lt;aws_region&amp;gt;/visualization/&lt;/code&gt;.&lt;/p&gt; 
&lt;h3&gt;Performance insights&lt;/h3&gt; 
&lt;p&gt;The workload characterization of this job reveals a CPU-intensive profile, primarily because of the processing of small binary files with GZIP compression and subsequent JSON parsing. Given the inherent nature of this pipeline, which includes Python UDF serialization and partial single-partition write stages, linear scaling does not yield proportional performance gains. The following table presents an analysis of AWS Glue configurations, evaluating the trade-off between computational capacity, execution duration, and associated costs:&lt;/p&gt; 
&lt;table class="styled-table" border="1px" cellpadding="10px"&gt; 
 &lt;tbody&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;&lt;strong&gt;Duration&lt;/strong&gt;&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;&lt;strong&gt;Capacity (DPUs)&lt;/strong&gt;&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;&lt;strong&gt;Worker type&lt;/strong&gt;&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;&lt;strong&gt;Glue version&lt;/strong&gt;&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;&lt;strong&gt;Estimated Cost*&lt;/strong&gt;&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;10 m 7 s&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;32 DPUs&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;G.1X&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;5&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;&lt;strong&gt;$2.34&lt;/strong&gt;&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;11 m 50 s&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;10 DPUs&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;G.1X&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;5&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;&lt;strong&gt;$0.88&lt;/strong&gt;&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;19 m 7 s&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;4 DPUs&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;G.1X&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;5&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;&lt;strong&gt;$0.59&lt;/strong&gt;&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;8 m 19 s&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;20 DPUs&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;G.2X&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;5&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;&lt;strong&gt;$1.32&lt;/strong&gt;&lt;/td&gt; 
  &lt;/tr&gt; 
 &lt;/tbody&gt; 
&lt;/table&gt; 
&lt;p&gt;*Estimated Cost = DPUs x Duration (hours) x $0.44 per DPU-hour (&lt;code&gt;us-east-1&lt;/code&gt;)&lt;/p&gt; 
&lt;h2&gt;Visualizing and analyzing geospatial data with Kepler.gl&lt;/h2&gt; 
&lt;p&gt;&lt;a href="https://kepler.gl/" target="_blank" rel="noopener noreferrer"&gt;Kepler.gl&lt;/a&gt; is an open-source geospatial analysis tool developed by &lt;a href="https://www.uber.com/en-HK/blog/keplergl/" target="_blank" rel="noopener noreferrer"&gt;Uber&lt;/a&gt; with code available at &lt;a href="https://github.com/keplergl/kepler.gl" target="_blank" rel="noopener noreferrer"&gt;Github&lt;/a&gt;. Kepler.gl is designed for large-scale data exploration and visualization, offering multiple map layers, including point, arc, heatmap, and 3D hexagon. It supports various file formats like CSV, GeoJSON, and KML. In this use case, we will use Kepler.gl to present interactive visualizations that illustrate flight patterns, routes, and densities across global airspace.&lt;/p&gt; 
&lt;h3&gt;Downloading the geospatial files&lt;/h3&gt; 
&lt;p&gt;Before we can view the graph, we will need to download the flight files to our local machine, unzip them, and rename them (to make it easier to identify the files).&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;Open your OS terminal command line.&lt;/li&gt; 
 &lt;li&gt;Create the folders to download the data processed in the steps before. In this case, we create &lt;strong&gt;kepler&lt;/strong&gt; and &lt;strong&gt;kepler_csv&lt;/strong&gt;. &lt;pre&gt;&lt;code class="lang-bash"&gt;	#create kepler folders: first folder is to download the files,
	#second folder is to organize the files to use in the next step
	mkdir kepler
	mkdir kepler_csv
	&lt;/code&gt;&lt;/pre&gt; &lt;/li&gt; 
 &lt;li&gt;Replace the bracketed variables with your account and directory information, then download all the CSV files. &lt;pre&gt;&lt;code class="lang-bash"&gt;	#copy the files from Amazon S3 to local machine
	aws s3 cp s3://blog-sedona-nessie-&amp;lt;account_number&amp;gt;-&amp;lt;aws_region&amp;gt;/visualization/ /&amp;lt;user_directory&amp;gt;/kepler --recursive
	&lt;/code&gt;&lt;/pre&gt; &lt;/li&gt; 
 &lt;li&gt;Extract the files, rename them, and move them to another folder. &lt;pre&gt;&lt;code class="lang-bash"&gt;	# Extract the files processed by Spark and Sedona
	gzip -d ./kepler/kepler_h3_density/*.gz
	gzip -d ./kepler/kepler_track_points_sample/*.gz
	
	# Rename the Spark output files to more readable names
	cd ./kepler/kepler_h3_density/
	ls
	mv part-00000-*.csv kepler_h3_density.csv
	cd ..
	
	cd ./kepler/kepler_track_points_sample/
	ls
	mv part-00000-*.csv kepler_track_points_sample.csv
	cd ..
	
	# Ensure the output folder exists
	mkdir -p ../kepler_csv
	
	# Copy the renamed CSV files to the folder that will be used as input in kepler.gl
	cp ./kepler/kepler_h3_density/*.csv ../kepler_csv
	cp ./kepler/kepler_track_points_sample/*.csv ../kepler_csv
	&lt;/code&gt;&lt;/pre&gt; &lt;/li&gt; 
 &lt;li&gt;Your &lt;strong&gt;kepler_csv&lt;/strong&gt; folder should look similar to the return of the command below. &lt;pre&gt;&lt;code class="lang-bash"&gt;	#list the files in the kepler_csv directory
	ls -l
	total 11684
	-rw-rw-r-- 1 ec2-user ec2-user 8630110 Jun 12 14:47 kepler_h3_density.csv
	-rw-rw-r-- 1 ec2-user ec2-user 3331763 Jun 12 14:47 kepler_track_points_sample.csv
	&lt;/code&gt;&lt;/pre&gt; &lt;/li&gt; 
&lt;/ol&gt; 
&lt;h3&gt;Visualizing the data in a graph&lt;/h3&gt; 
&lt;p&gt;Now that you have saved the data to your local machine, you can analyze the flight data through interactive map graphics. To import the data into the Kepler.gl web visualization tool:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;Open the &lt;a href="https://kepler.gl/demo" target="_blank" rel="noopener noreferrer"&gt;Kepler.gl Demo&lt;/a&gt; web application.&lt;/li&gt; 
 &lt;li&gt;Load data into Kepler.gl: 
  &lt;ol type="a"&gt; 
   &lt;li&gt;Choose &lt;strong&gt;Add Data&lt;/strong&gt; in the left panel.&lt;/li&gt; 
   &lt;li&gt;Drag and drop both CSV files (&lt;code&gt;flight_points&lt;/code&gt; and &lt;code&gt;h3_density&lt;/code&gt;) into the upload area.&lt;/li&gt; 
   &lt;li&gt;Confirm that both datasets are loaded successfully.&lt;/li&gt; 
  &lt;/ol&gt; &lt;/li&gt; 
 &lt;li&gt;Delete all layers.&lt;/li&gt; 
 &lt;li&gt;Create the &lt;strong&gt;Flight Density Layer:&lt;/strong&gt; 
  &lt;ol type="a"&gt; 
   &lt;li&gt;Choose &lt;strong&gt;Add Layer&lt;/strong&gt; in the left panel.&lt;/li&gt; 
   &lt;li&gt;In &lt;strong&gt;Basic&lt;/strong&gt;, choose &lt;strong&gt;H3&lt;/strong&gt; as the layer type, then add the following configuration: 
    &lt;ol type="i"&gt; 
     &lt;li&gt;Layer Name: &lt;strong&gt;Flight Density&lt;/strong&gt;&lt;/li&gt; 
     &lt;li&gt;Data Source: &lt;strong&gt;kepler_h3_density.csv&lt;/strong&gt;&lt;/li&gt; 
     &lt;li&gt;Hex ID: &lt;strong&gt;h3_index&lt;/strong&gt;&lt;/li&gt; 
    &lt;/ol&gt; &lt;/li&gt; 
   &lt;li&gt;In the &lt;strong&gt;Fill Color&lt;/strong&gt; section: 
    &lt;ol type="i"&gt; 
     &lt;li&gt;Color: &lt;strong&gt;point_count&lt;/strong&gt;&lt;/li&gt; 
     &lt;li&gt;Color Scale: &lt;strong&gt;Quantile&lt;/strong&gt;.&lt;/li&gt; 
     &lt;li&gt;Color Range: Choose a blue/green gradient.&lt;/li&gt; 
    &lt;/ol&gt; &lt;/li&gt; 
   &lt;li&gt;Set &lt;strong&gt;Opacity&lt;/strong&gt; to &lt;strong&gt;0.7&lt;/strong&gt;.&lt;/li&gt; 
   &lt;li&gt;In the &lt;strong&gt;Coverage&lt;/strong&gt; section, set it to &lt;strong&gt;0.9&lt;/strong&gt;.&lt;/li&gt; 
  &lt;/ol&gt; &lt;/li&gt; 
 &lt;li&gt;Create the &lt;strong&gt;Flight Tracks Layer:&lt;/strong&gt; 
  &lt;ol type="a"&gt; 
   &lt;li&gt;Choose &lt;strong&gt;Add Layer&lt;/strong&gt; in the left panel.&lt;/li&gt; 
   &lt;li&gt;In &lt;strong&gt;Basic&lt;/strong&gt;, choose &lt;strong&gt;Point&lt;/strong&gt; as the layer type, then add the following configuration: 
    &lt;ol type="i"&gt; 
     &lt;li&gt;Layer Name: &lt;strong&gt;Flight Tracks&lt;/strong&gt;&lt;/li&gt; 
     &lt;li&gt;Data Source: &lt;strong&gt;kepler_track_points_sample.csv&lt;/strong&gt;&lt;/li&gt; 
     &lt;li&gt;Columns: 
      &lt;ol&gt; 
       &lt;li&gt;Latitude: &lt;strong&gt;lat&lt;/strong&gt;&lt;/li&gt; 
       &lt;li&gt;Longitude: &lt;strong&gt;lon&lt;/strong&gt;&lt;/li&gt; 
      &lt;/ol&gt; &lt;/li&gt; 
    &lt;/ol&gt; &lt;/li&gt; 
   &lt;li&gt;In the &lt;strong&gt;Fill Color &lt;/strong&gt;section: 
    &lt;ol type="i"&gt; 
     &lt;li&gt;Solid Color: &lt;strong&gt;Orange&lt;/strong&gt;&lt;/li&gt; 
     &lt;li&gt;Opacity: &lt;strong&gt;0.3&lt;/strong&gt;&lt;/li&gt; 
    &lt;/ol&gt; &lt;/li&gt; 
   &lt;li&gt;Set the Point’s &lt;strong&gt;Radius&lt;/strong&gt; to 1&lt;/li&gt; 
  &lt;/ol&gt; &lt;/li&gt; 
 &lt;li&gt;The layers should look similar to the following figure.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90108" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/10/BDB-5249-geospatial-11.jpg" alt="Kepler.gl layer configuration panel for Flight Density H3 layer using kepler_h3_density.csv data source." width="998" height="1051"&gt;&lt;/p&gt; 
&lt;ol start="7"&gt; 
 &lt;li&gt;The graph visualization should now show flight density through color-coded hexagons, with individual flight tracks visible as orange points:&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90109" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/10/BDB-5249-geospatial-12.jpg" alt="Kepler.gl interactive map visualization displaying global flight density heatmap. High-density areas shown in yellow over North America, particularly the United States." width="1897" height="924"&gt;&lt;/p&gt; 
&lt;p&gt;There you go! Now that you have knowledge about geospatial data and have created your first use case, take the opportunity to do some analysis and learn some interesting facts about flight patterns.&lt;/p&gt; 
&lt;p&gt;It is possible to experiment with other interesting types of analysis in Kepler.gl, such as &lt;a href="https://docs.kepler.gl/docs/user-guides/h-playback" target="_blank" rel="noopener noreferrer"&gt;Time Playback&lt;/a&gt;.&lt;/p&gt; 
&lt;h2&gt;Clean up&lt;/h2&gt; 
&lt;p&gt;To clean up your resources, complete the following tasks:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;Delete the AWS Glue job &lt;code&gt;process_sedona_geo_track&lt;/code&gt;.&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/empty-bucket.html" target="_blank" rel="noopener noreferrer"&gt;Delete content&lt;/a&gt; from the Amazon S3 buckets: &lt;code&gt;blog-sedona-artifacts-&amp;lt;account_number&amp;gt;-&amp;lt;aws_region&amp;gt;&lt;/code&gt; and &lt;code&gt;blog-sedona-nessie-&amp;lt;account_number&amp;gt;-&amp;lt;aws_region&amp;gt;&lt;/code&gt;.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;h2&gt;Conclusion&lt;/h2&gt; 
&lt;p&gt;In this post, we showed how processing geospatial data can present significant challenges due to its complex nature (from big data to data structure format). For this use case of flight trackers, it involves vast amounts of information across multiple dimensions such as time, location, altitude, and flight paths, however, the combination of Spark’s distributed computing capabilities and Sedona’s optimized geospatial functions helps overcome those challenges. The spatial partitioning and indexing features of Sedona, coupled with Spark’s framework, enable us to perform complex spatial joins and proximity analyses efficiently, simplifying the overall data processing workflow.&lt;/p&gt; 
&lt;p&gt;The serverless nature of AWS Glue eliminates the need for managing infrastructure while automatically scaling resources based on workload demands, making it an ideal platform for processing growing volumes of flight data. As the volume of flight data grows or as processing requirements fluctuate, with AWS Glue, you can quickly adjust resources to meet demand, ensuring optimal performance without the need for cluster management.&lt;/p&gt; 
&lt;p&gt;By converting the processed results into CSV format and visualizing them in Kepler.gl, it is possible to create interactive visualizations that reveal patterns in flight paths, and you can efficiently analyze air traffic patterns, routes, and other insights. This end-to-end solution demonstrates how a modern data strategy in AWS with the support of open-source tools can transform raw geospatial data into actionable insights.&lt;/p&gt; 
&lt;hr style="width: 80%"&gt; 
&lt;h2&gt;About the authors&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignnone size-full wp-image-90123" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/13/ruanroloff.jpeg" alt="Ruan" width="100" height="133"&gt;
  &lt;/div&gt; 
  &lt;p&gt;&lt;strong&gt;Ruan Roloff&lt;/strong&gt; is a Lead GTM Specialist Architect for Analytics and AI at AWS. During his time at AWS, he was responsible for the data journey and AI product strategy of customers across a range of industries, including finance, oil and gas, manufacturing, digital natives, public sector, and startups. He has helped these organizations achieve multi-million dollar use cases. Outside of work, Ruan likes to assemble and disassemble things, fish on the beach with friends, play SFII, and go hiking in the woods with his family.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignnone size-full wp-image-90122" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/13/lucasvitoreti.jpeg" alt="Lucas" width="100" height="133"&gt;
  &lt;/div&gt; 
  &lt;p&gt;&lt;strong&gt;Lucas Vitoreti&lt;/strong&gt; is a ProServe Data &amp;amp; Analytics Specialist at AWS with 12+ years in the data domain. Architects and delivers solutions for data warehouses, lakes, lakehouses, and meshes, helping organizations transform their data strategies and achieve business outcomes. Expertise in scalable data architectures and guiding data-driven transformations. He balances professional life with weightlifting, music, and family time.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignnone size-full wp-image-90121" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/13/denysgonzaga.jpeg" alt="Denys" width="100" height="133"&gt;
  &lt;/div&gt; 
  &lt;p&gt;&lt;strong&gt;Denys Gonzaga&lt;/strong&gt; is a ProServe Consultant at AWS, he is an experienced professional with over 15 years of working across multiple technical domains, with a strong focus on development and data analytics. Throughout his career, he has successfully applied his skills in various industries, including aerospace, finance, telecommunications, and retail. Outside of AWS, Denys enjoys spending time with his family and playing video games.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
					
			
		
		
			</item>
		<item>
		<title>Analyzing your data catalog: Query SageMaker Catalog metadata with SQL</title>
		<link>https://aws.amazon.com/blogs/big-data/analyzing-your-data-catalog-query-sagemaker-catalog-metadata-with-sql/</link>
					
		
		<dc:creator><![CDATA[Ramesh H Singh]]></dc:creator>
		<pubDate>Wed, 22 Apr 2026 15:37:38 +0000</pubDate>
				<category><![CDATA[Amazon SageMaker Data & AI Governance]]></category>
		<category><![CDATA[Amazon SageMaker Unified Studio]]></category>
		<category><![CDATA[Intermediate (200)]]></category>
		<category><![CDATA[Technical How-to]]></category>
		<guid isPermaLink="false">6de67b4491648f7f2b2aa026de7adf83a8026c80</guid>

					<description>In this post, we demonstrate how to use the metadata export capability in Amazon SageMaker Catalog and perform analytics such as historical changes, monitor asset growth and track metadata improvements.</description>
										<content:encoded>&lt;p&gt;As your data and machine learning (ML) assets grow, tracking which assets lack documentation or monitoring asset registration trends becomes challenging without custom reporting infrastructure. You need visibility into your catalog’s health, without the overhead of managing ETL jobs. The metadata feature of &lt;a href="https://aws.amazon.com/sagemaker/" target="_blank" rel="noopener noreferrer"&gt;Amazon SageMaker&lt;/a&gt; provides this capability to users.&amp;nbsp;Converting catalog asset metadata into &lt;a href="https://docs.aws.amazon.com/prescriptive-guidance/latest/apache-iceberg-on-aws/introduction.html" target="_blank" rel="noopener noreferrer"&gt;Apache Iceberg&lt;/a&gt; tables stored in &lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables.html" target="_blank" rel="noopener noreferrer"&gt;Amazon S3 Tables&lt;/a&gt; removes the need to build and maintain custom ETL pipelines. Your team can then query asset metadata directly using standard SQL tools. You can now answer governance questions like asset registration trends, classification status, and metadata completeness using standard SQL queries through tools like &lt;a href="https://aws.amazon.com/athena/" target="_blank" rel="noopener noreferrer"&gt;Amazon Athena&lt;/a&gt;, &lt;a href="https://aws.amazon.com/sagemaker/unified-studio/" target="_blank" rel="noopener noreferrer"&gt;Amazon SageMaker Unified Studio&lt;/a&gt; notebooks, and BIsystems.&lt;/p&gt; 
&lt;p&gt;This automated approach reduces ETL development time and gives your team visibility into catalog health, compliance gaps, and asset lifecycle patterns. The exported tables include technical metadata, business metadata, project ownership details, and timestamps, partitioned by snapshot date to enable time travel queries and historical analysis. Teams can use this capability to proactively monitor catalog health, identify gaps in documentation, track asset lifecycle patterns, and make sure that governance policies are consistently applied.&lt;/p&gt; 
&lt;h2&gt;How metadata export works&lt;/h2&gt; 
&lt;p&gt;After you enable the metadata export feature, it runs automatically on a daily schedule:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;&lt;strong&gt;SageMaker Catalog creates the infrastructure&lt;/strong&gt; — An Amazon Simple Storage Service (Amazon S3) table bucket named &lt;code&gt;aws-sagemaker-catalog&lt;/code&gt; is created with an &lt;code&gt;asset_metadata&lt;/code&gt; namespace and an empty asset table.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Daily snapshots are captured&lt;/strong&gt; — A scheduled job runs once per day around midnight (local time per AWS Region) to export updated asset metadata.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Metadata is structured and partitioned&lt;/strong&gt; — The export captures technical metadata (resource_id, resource_type), business metadata (asset_name, business_description), project ownership details, and timestamps, partitioned by &lt;code&gt;snapshot_date&lt;/code&gt; for query performance.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Data becomes queryable&lt;/strong&gt; — Within 24 hours, the asset table appears in Amazon SageMaker Unified Studio under the &lt;code&gt;aws-sagemaker-catalog&lt;/code&gt; bucket and becomes accessible through Amazon Athena, Studio notebooks, or external BI tools.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Teams query using standard SQL&lt;/strong&gt; — Data teams can now answer questions like “How many assets were registered last month?” or “Which assets lack business descriptions?” without building custom ETL pipelines.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;The export evaluates catalog assets and their metadata properties in the domain, converting them into Apache Iceberg table format. The data flows into downstream analytics operations immediately, with no separate ETL or batch processes to maintain. The exported metadata becomes part of a queryable data lake that supports time-travel queries and historical analysis.&lt;/p&gt; 
&lt;p&gt;In this post, we demonstrate how to use the metadata export capability in Amazon SageMaker Catalog and perform analytics on these tables. We explore the following specific use-cases.&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Audit historical changes to investigate what an asset looked like at a specific point in time.&lt;/li&gt; 
 &lt;li&gt;Monitor asset growth view how the data catalog has grown over the last 30 days.&lt;/li&gt; 
 &lt;li&gt;Track metadata improvements to see which assets gained descriptions or ownership over time.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;Solution overview&lt;/h2&gt; 
&lt;div id="attachment_90416" style="width: 1431px" class="wp-caption alignleft"&gt;
 &lt;img aria-describedby="caption-attachment-90416" loading="lazy" class="wp-image-90416 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/BDB-5843-image-1.jpeg" alt="AWS Cloud architecture diagram showing data pipeline from Amazon SageMaker Catalog to Amazon S3 Tables with daily export, connecting to query engines including Amazon Athena, Amazon Redshift, and Apache Spark" width="1421" height="801"&gt;
 &lt;p id="caption-attachment-90416" class="wp-caption-text"&gt;Figure 1 – SageMaker catalog export to S3 Tables&lt;/p&gt;
&lt;/div&gt; 
&lt;p&gt;The architecture consists of three key components:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;Amazon SageMaker Catalog exports asset metadata daily to Amazon S3.&lt;/li&gt; 
 &lt;li&gt;S3 Tables stores metadata as Apache Iceberg tables in the &lt;code&gt;aws-sagemaker-catalog&lt;/code&gt; bucket with ACID compliance and time travel.&lt;/li&gt; 
 &lt;li&gt;Query engines (Amazon Athena, &lt;a href="https://aws.amazon.com/pm/redshift/" target="_blank" rel="noopener noreferrer"&gt;Amazon Redshift&lt;/a&gt;, and Apache Spark) access metadata using standard SQL from the &lt;code&gt;asset_metadata.asset&lt;/code&gt; table.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;h3&gt;What metadata is exposed?&lt;/h3&gt; 
&lt;p&gt;SageMaker Catalog exports metadata in the &lt;code&gt;asset_metadata.asset&lt;/code&gt; table:&lt;/p&gt; 
&lt;table class="styled-table" border="1px" cellpadding="10px"&gt; 
 &lt;tbody&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;&lt;strong&gt;Metadata Type&lt;/strong&gt;&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;&lt;strong&gt;Fields&lt;/strong&gt;&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;&lt;strong&gt;Description&lt;/strong&gt;&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Technical metadata&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;&lt;code&gt;resource_id&lt;/code&gt;, &lt;code&gt;resource_type_enum&lt;/code&gt;, &lt;code&gt;account_id&lt;/code&gt;, &lt;code&gt;region&lt;/code&gt;&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Resource identifiers (ARN), types (&lt;code&gt;GlueTable&lt;/code&gt;, &lt;code&gt;RedshiftTable&lt;/code&gt;, &lt;code&gt;S3Collection&lt;/code&gt;), and location&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Namespace hierarchy&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;&lt;code&gt;catalog&lt;/code&gt;, &lt;code&gt;namespace&lt;/code&gt;, &lt;code&gt;resource_name&lt;/code&gt;&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Organizational structure for assets&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Business metadata&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;&lt;code&gt;asset_name&lt;/code&gt;, &lt;code&gt;business_description&lt;/code&gt;&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Human-readable names and descriptions&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Ownership&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;&lt;code&gt;extended_metadata['owningEntityId']&lt;/code&gt;&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Asset ownership information&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Timestamps&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;&lt;code&gt;asset_created_time&lt;/code&gt;, &lt;code&gt;asset_updated_time&lt;/code&gt;, &lt;code&gt;snapshot_time&lt;/code&gt;&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Creation&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Custom metadata&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;&lt;code&gt;extended_metadata['form-name.field-name']&lt;/code&gt;&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;User-defined metadata forms as key-value pairs&lt;/td&gt; 
  &lt;/tr&gt; 
 &lt;/tbody&gt; 
&lt;/table&gt; 
&lt;p&gt;The &lt;code&gt;snapshot_time&lt;/code&gt; column supports point-in-time analysis and query of historical catalog states.&lt;/p&gt; 
&lt;h2&gt;Prerequisites&lt;/h2&gt; 
&lt;p&gt;To follow along with this post, you must have the following:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;An &lt;a href="https://aws.amazon.com/sagemaker/unified-studio/" target="_blank" rel="noopener noreferrer"&gt;Amazon SageMaker Unified Studio&lt;/a&gt; domain set up with a domain owner or domain unit owner permissions. 
  &lt;ul&gt; 
   &lt;li&gt;A SageMaker Unified Studio domain identifier&lt;/li&gt; 
  &lt;/ul&gt; &lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://aws.amazon.com/iam/" target="_blank" rel="noopener noreferrer"&gt;AWS Identity and Access Management (IAM)&lt;/a&gt; permissions for configuring metadata export.&lt;/li&gt; 
 &lt;li&gt;Grant catalog, database, and table Select and Describe permissions with &lt;a href="https://aws.amazon.com/lake-formation/" target="_blank" rel="noopener noreferrer"&gt;AWS Lake Formation&lt;/a&gt;.&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/cli/latest/userguide/getting-started-install.html" target="_blank" rel="noopener noreferrer"&gt;AWS Command Line Interface (AWS CLI)&lt;/a&gt; version 2.33.0 or later installed and configured&lt;/li&gt; 
 &lt;li&gt;An Amazon SageMaker project for publishing assets.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;For SageMaker Unified Studio domain setup instructions, refer to the SageMaker Unified Studio &lt;a href="https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/getting-started.html" target="_blank" rel="noopener noreferrer"&gt;Getting started&lt;/a&gt; guide.&lt;/p&gt; 
&lt;p&gt;After you complete the prerequisites, complete the following steps.&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;Add this policy to our IAM user or role to enable metadata export. If using SageMaker Unified Studio to query the catalog, add this policy to the &lt;code&gt;AmazonSageMakerAdminIAMExecutionRole&lt;/code&gt; managed role.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;pre&gt;&lt;code class="lang-json"&gt;{ "Version": "2012-10-17", 
"Statement": [ 
{
 "Effect": "Allow",
 "Action": [ "datazone:GetDataExportConfiguration",
 "datazone:PutDataExportConfiguration"
 ],
 "Resource": "*"
 },
 {
 "Effect": "Allow",
 "Action": [
 "s3tables:CreateTableBucket",
 "s3tables:PutTableBucketPolicy"
 ],
 "Resource": "arn:aws:s3tables:*:*:bucket/aws-sagemaker-catalog" 
} 
]
}&lt;/code&gt;&lt;/pre&gt; 
&lt;ol start="2"&gt; 
 &lt;li&gt;&lt;strong&gt;Grant describe&lt;/strong&gt; and &lt;strong&gt;select&lt;/strong&gt; permissions for SageMaker Catalog with AWS Lake Formation. This step can be performed in the AWS Lake Formation console. 
  &lt;ol type="a"&gt; 
   &lt;li&gt;Select &lt;strong&gt;Permissions&lt;/strong&gt; -&amp;gt; &lt;strong&gt;Data permissions&lt;/strong&gt; and choose &lt;strong&gt;&lt;strong&gt;Grant.&lt;br&gt; &lt;/strong&gt;&lt;/strong&gt;&lt;p&gt;&lt;/p&gt; &lt;p&gt;&lt;/p&gt;
    &lt;div id="attachment_90415" style="width: 1435px" class="wp-caption alignnone"&gt;
     &lt;img aria-describedby="caption-attachment-90415" loading="lazy" class="size-full wp-image-90415" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/BDB-5843-image-2.jpeg" alt="AWS Lake Formation Grant Permissions interface showing principal type selection with IAM users and roles option selected and AmazonSageMakerAdminIAMExecutionRole assigned" width="1425" height="878"&gt;
     &lt;p id="caption-attachment-90415" class="wp-caption-text"&gt;Figure 2 – AWS Lake Formation grant permission&lt;/p&gt;
    &lt;/div&gt;&lt;/li&gt; 
   &lt;li&gt;Under &lt;strong&gt;Principal type&lt;/strong&gt;, select &lt;strong&gt;Principals&lt;/strong&gt;, &lt;strong&gt;IAM users and roles&lt;/strong&gt; and the AWS managed &lt;strong&gt;AmazonSageMakerAdminIAMExecutionRole&lt;/strong&gt; execution role.&lt;/li&gt; 
   &lt;li&gt;Choose &lt;strong&gt;Named Data Catalog resources&lt;/strong&gt;.&lt;/li&gt; 
   &lt;li&gt;Under &lt;strong&gt;Catalogs&lt;/strong&gt;, search for and select &lt;strong&gt;&amp;lt;account-id&amp;gt;:s3tablecatalog/aws-sagemaker-catalog.&lt;/strong&gt;&lt;/li&gt; 
   &lt;li&gt;Under &lt;strong&gt;Databases&lt;/strong&gt;, select &lt;strong&gt;asset_metadata&lt;/strong&gt; database. 
    &lt;div id="attachment_90414" style="width: 1439px" class="wp-caption alignnone"&gt;
     &lt;img aria-describedby="caption-attachment-90414" loading="lazy" class="size-full wp-image-90414" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/BDB-5843-image-3.jpeg" alt="AWS Lake Formation Grant Permissions page showing Named Data Catalog resources method with s3tablescatalog/aws-sagemaker-catalog selected, asset_metadata database, and asset table configured" width="1429" height="1073"&gt;
     &lt;p id="caption-attachment-90414" class="wp-caption-text"&gt;Figure 3 – AWS Lake Formation catalog, database, and table&lt;/p&gt;
    &lt;/div&gt; &lt;p&gt;&lt;/p&gt;
    &lt;div id="attachment_90413" style="width: 1438px" class="wp-caption alignnone"&gt;
     &lt;img aria-describedby="caption-attachment-90413" loading="lazy" class="size-full wp-image-90413" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/BDB-5843-image-4.jpeg" alt="AWS Lake Formation Grant Permissions interface showing table permissions with Select and Describe checked, grantable permissions section, and All data access radio button selected" width="1428" height="1247"&gt;
     &lt;p id="caption-attachment-90413" class="wp-caption-text"&gt;Figure 4 – AWS Lake Formation grant permission&lt;/p&gt;
    &lt;/div&gt;&lt;/li&gt; 
   &lt;li&gt;For &lt;strong&gt;Table&lt;/strong&gt;, select &lt;strong&gt;asset&lt;/strong&gt;.&lt;/li&gt; 
   &lt;li&gt;Under &lt;strong&gt;Table permissions&lt;/strong&gt;, check &lt;strong&gt;Select&lt;/strong&gt; and &lt;strong&gt;Describe.&lt;/strong&gt;&lt;/li&gt; 
   &lt;li&gt;Choose &lt;strong&gt;Grant&lt;/strong&gt; to save the permissions.&lt;/li&gt; 
  &lt;/ol&gt; &lt;/li&gt; 
&lt;/ol&gt; 
&lt;h3&gt;Enable data export using the AWS CLI&lt;/h3&gt; 
&lt;p&gt;Configure metadata export using the &lt;code&gt;PutDataExportConfiguration&lt;/code&gt; API. The &lt;a href="https://aws.amazon.com/datazone/" target="_blank" rel="noopener noreferrer"&gt;Amazon DataZone&lt;/a&gt; service automatically creates an S3 table bucket named &lt;code&gt;aws-sagemaker-catalog&lt;/code&gt; with an &lt;code&gt;asset_metadata&lt;/code&gt; namespace, and schedules a daily export job.&amp;nbsp;Asset metadata is exported once daily around midnight local time per AWS Region.&lt;/p&gt; 
&lt;p&gt;The SageMaker Domain identifier is available on domain detail page in the &lt;a href="https://aws.amazon.com/console/" target="_blank" rel="noopener noreferrer"&gt;AWS Management Console&lt;/a&gt;. Accessing the asset table through the S3 Tables console or the Data tab in SageMaker Unified Studio can require up to 24 hours.&lt;/p&gt; 
&lt;p&gt;AWS CLI command to enable SageMaker catalog export:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;aws datazone put-data-export-configuration --domain-identifier &amp;lt;domain-id&amp;gt; --region &amp;lt;region&amp;gt; --enable-export&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Use this AWS CLI command to validate the configuration is enabled:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-css"&gt;aws datazone get-data-export-configuration --domain-identifier &amp;lt;domain-id&amp;gt;&amp;nbsp;--region &amp;lt;region&amp;gt;
{
&amp;nbsp;&amp;nbsp; &amp;nbsp;"isExportEnabled": true,
&amp;nbsp;&amp;nbsp; &amp;nbsp;"status": "COMPLETED",
&amp;nbsp;&amp;nbsp; &amp;nbsp;"s3TableBucketArn": "arn:aws:s3tables:&amp;lt;region&amp;gt;:&amp;lt;account-id&amp;gt;:bucket/aws-sagemaker-catalog",
&amp;nbsp;&amp;nbsp; &amp;nbsp;"createdAt": "2025-11-26T18:24:02.150000+00:00",
&amp;nbsp;&amp;nbsp; &amp;nbsp;"updatedAt": "2026-02-23T19:33:40.987000+00:00"
}&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;h3&gt;Access the exported asset table&lt;/h3&gt; 
&lt;ol&gt; 
 &lt;li&gt;Navigate to Amazon SageMaker &lt;strong&gt;Domains&lt;/strong&gt; in the AWS Management Console.&lt;/li&gt; 
 &lt;li&gt;Select your domain and select &lt;strong&gt;Open&lt;/strong&gt;. &lt;p&gt;&lt;/p&gt;
  &lt;div id="attachment_90412" style="width: 1440px" class="wp-caption alignnone"&gt;
   &lt;img aria-describedby="caption-attachment-90412" loading="lazy" class="size-full wp-image-90412" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/BDB-5843-image-5.jpeg" alt="Amazon SageMaker Domains management page showing an Identity Center based domain with Available status, created February 26, 2026, with Open unified studio button highlighted" width="1430" height="313"&gt;
   &lt;p id="caption-attachment-90412" class="wp-caption-text"&gt;Figure 5 – Open Amazon SageMaker Unified Studio&lt;/p&gt;
  &lt;/div&gt;&lt;/li&gt; 
 &lt;li&gt;In SageMaker Unified Studio, choose a project from the &lt;strong&gt;Select a project&lt;/strong&gt; dropdown list.&lt;/li&gt; 
 &lt;li&gt;To query SageMaker catalog data, select &lt;strong&gt;Build&lt;/strong&gt; in the menu bar and then choose &lt;strong&gt;Query Editor&lt;/strong&gt;. To create a new project, follow the instructions in the &lt;a href="https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/getting-started-create-a-project.html" target="_blank" rel="noopener noreferrer"&gt;Amazon SageMaker Unified Studio User Guide&lt;/a&gt;. &lt;p&gt;&lt;/p&gt;
  &lt;div id="attachment_90411" style="width: 1439px" class="wp-caption alignnone"&gt;
   &lt;img aria-describedby="caption-attachment-90411" loading="lazy" class="size-full wp-image-90411" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/BDB-5843-image-6.jpeg" alt="SageMaker Unified Studio project overview dashboard showing IDE and Applications, Data Analysis and Integration with Query Editor highlighted, Orchestration, and Machine Learning and Generative AI categories" width="1429" height="620"&gt;
   &lt;p id="caption-attachment-90411" class="wp-caption-text"&gt;Figure 6 – Open SageMaker Unified Studio Query Editor&lt;/p&gt;
  &lt;/div&gt;&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;The&amp;nbsp;&lt;code&gt;asset_metadata.asset&lt;/code&gt;&amp;nbsp;table is available in Data explorer. Use &lt;strong&gt;Data explorer&lt;/strong&gt; to view the schema and query data to perform analytics from.&lt;/p&gt; 
&lt;ol start="5"&gt; 
 &lt;li&gt;Expand &lt;strong&gt;Catalogs&lt;/strong&gt; in Data explorer. Then, select and expand &lt;strong&gt;s3tablecatalog, aws-sagemaker-catalog&lt;/strong&gt;, &lt;strong&gt;asset_metadata,&lt;/strong&gt; and &lt;strong&gt;asset.&lt;/strong&gt;&lt;/li&gt; 
 &lt;li&gt;Test querying the catalog with &lt;code&gt;SELECT * FROM asset_metadata.asset LIMIT 10;&lt;/code&gt;.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;div id="attachment_90410" style="width: 1439px" class="wp-caption alignleft"&gt;
 &lt;img aria-describedby="caption-attachment-90410" loading="lazy" class="wp-image-90410 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/BDB-5843-image-7.jpeg" alt="SageMaker Unified Studio Query Editor with Data Explorer showing Lakehouse hierarchy including s3tablescatalog, aws-sagemaker-catalog, asset_metadata database, and asset table schema with SQL SELECT query" width="1429" height="731"&gt;
 &lt;p id="caption-attachment-90410" class="wp-caption-text"&gt;Figure 7 – Query SageMaker catalog&lt;/p&gt;
&lt;/div&gt; 
&lt;h2&gt;Queries for observability and analytics&lt;/h2&gt; 
&lt;p&gt;With setup complete, execute queries to gain insights on catalog usage and changes. To monitor asset growth, and view how the data catalog has grown over the last five days:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-sql"&gt;SELECT 
 &amp;nbsp;&amp;nbsp; DATE (snapshot_time) as date,
 &amp;nbsp;&amp;nbsp; COUNT (*) as total_assets
FROM asset_metadata.asset
WHERE 
 &amp;nbsp;&amp;nbsp; &amp;nbsp;DATE (snapshot_time) &amp;gt;= CURRENT_DATE - INTERVAL '5' DAY
GROUP BY DATE (snapshot_time)
ORDER BY date DESC;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;div id="attachment_90409" style="width: 1439px" class="wp-caption alignleft"&gt;
 &lt;img aria-describedby="caption-attachment-90409" loading="lazy" class="wp-image-90409 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/BDB-5843-image-8.jpeg" alt="SageMaker Unified Studio Query Editor showing SQL aggregation query on asset_metadata.asset table with results displaying date and total_assets columns, returning 42 assets for March 7-8, 2026&amp;quot;" width="1429" height="730"&gt;
 &lt;p id="caption-attachment-90409" class="wp-caption-text"&gt;Figure 8 – Query asset growth&lt;/p&gt;
&lt;/div&gt; 
&lt;p&gt;Use the catalog to track metadata changes to determine which assets gained descriptions or ownership over time. Use this query to identify assets that gained business descriptions over the past five days by comparing today’s snapshot with the earlier snapshot.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-sql"&gt;SELECT
 &amp;nbsp;&amp;nbsp; t.asset_id,
 &amp;nbsp;&amp;nbsp; t.resource_name,
 &amp;nbsp;&amp;nbsp; p.business_description as description_before,
 &amp;nbsp;&amp;nbsp; t.business_description as description_now
FROM asset_metadata.asset t
JOIN asset_metadata.asset p ON t.asset_id = p.asset_id
WHERE DATE(t.snapshot_time) = CURRENT_DATE
 &amp;nbsp;&amp;nbsp; AND DATE(p.snapshot_time) = CURRENT_DATE - INTERVAL '5' DAY
 &amp;nbsp;&amp;nbsp; AND p.business_description IS NULL
 &amp;nbsp;&amp;nbsp; AND t.business_description IS NOT NULL;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Investigate asset values at a specific point in time using this query to retrieve metadata from any snapshot date.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-sql"&gt;SELECT
 &amp;nbsp;&amp;nbsp; &amp;nbsp;asset_id,
 &amp;nbsp;&amp;nbsp; &amp;nbsp;resource_name,
 &amp;nbsp;&amp;nbsp; &amp;nbsp;business_description,
 &amp;nbsp;&amp;nbsp; &amp;nbsp;extended_metadata['owningEntityId'] as owner,
 &amp;nbsp;&amp;nbsp; &amp;nbsp;snapshot_time
FROM asset_metadata.asset
WHERE asset_id = 'your-asset-id'
 &amp;nbsp;&amp;nbsp; &amp;nbsp;AND DATE(snapshot_time) = DATE('2025-11-26');&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;h2&gt;Clean up resources&lt;/h2&gt; 
&lt;p&gt;To avoid ongoing charges, clean up the resources created in this walkthrough:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;&lt;strong&gt;Disable metadata export:&lt;/strong&gt;&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;Disable the daily metadata export to stop new snapshots:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-javascript"&gt;aws datazone put-data-export-configuration \
  --domain-identifier &amp;lt;domain-id. \
  --no-enable-export \
  --region &amp;lt;region&amp;gt;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;ol start="2"&gt; 
 &lt;li&gt;&lt;strong&gt;Delete S3 Tables resources:&lt;/strong&gt;&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;Optionally, delete the S3 Tables namespace containing the exported metadata to remove historical snapshots and stop storage charges. For instructions on how to delete S3 tables, see &lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-delete.html" target="_blank" rel="noopener noreferrer"&gt;Deleting an Amazon S3 table&lt;/a&gt; in the Amazon Simple Storage Service User Guide.&lt;/p&gt; 
&lt;h2&gt;Conclusion&lt;/h2&gt; 
&lt;p&gt;In this post, you enabled the metadata export feature of SageMaker Catalog and used SQL queries to gain visibility into your asset inventory. The feature converts asset metadata into Apache Iceberg tables partitioned by snapshot date, so you can perform time-travel queries, monitor catalog growth, track metadata completeness, and audit historical asset states. This provides a repeatable, low-overhead way to maintain catalog health and meet governance requirements over time.&lt;/p&gt; 
&lt;p&gt;To learn more about Amazon SageMaker Catalog, see the&amp;nbsp;&lt;a href="https://aws.amazon.com/sagemaker/catalog/" target="_blank" rel="noopener noreferrer"&gt;Amazon SageMaker Catalog documentation&lt;/a&gt;. To explore Apache Iceberg table formats and time-travel queries, see the&amp;nbsp;&lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables.html" target="_blank" rel="noopener noreferrer"&gt;Amazon S3 Tables documentation&lt;/a&gt;.&lt;/p&gt; 
&lt;hr&gt; 
&lt;h3&gt;About the Authors&lt;/h3&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignleft size-full wp-image-90408" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/BDB-5843-image-9.png" alt="Photo of Author Ramesh Singh" width="100" height="134"&gt;
  &lt;/div&gt; 
  &lt;p&gt;&lt;a href="http://www.linkedin.com/in/ramesh-harisaran-singh" target="_blank" rel="noopener noreferrer"&gt;Ramesh&lt;/a&gt;&amp;nbsp;is a Senior Product Manager Technical (External Services) at AWS in Seattle, Washington, currently with the Amazon SageMaker team. He is passionate about building high-performance ML/AI and analytics products that help enterprise customers achieve their critical goals using cutting-edge technology.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignleft size-full wp-image-90407" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/BDB-5843-image-10.png" alt="Photo of Author Pradeep Misra" width="100" height="130"&gt;
  &lt;/div&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/pradeep-m-326258a/" target="_blank" rel="noopener noreferrer"&gt;Pradeep&lt;/a&gt;&amp;nbsp;is a Principal Analytics and Applied AI Solutions Architect at AWS. He is passionate about solving customer challenges using data, analytics, and Applied AI. Outside of work, he likes exploring new places and playing badminton with his family. He also likes doing science experiments, building LEGOs, and watching anime with his daughters.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignleft size-full wp-image-90406" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/BDB-5843-image-11.png" alt="Photo of Author - Rohith Kayathi" width="190" height="203"&gt;
  &lt;/div&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/rohith-kayathi/" target="_blank" rel="noopener noreferrer"&gt;Rohith&lt;/a&gt; is a Senior Software Engineer at Amazon Web Services (AWS) working with Amazon SageMaker team. He leads business data catalog, generative AI–powered metadata curation, and lineage solutions. He is passionate about building large-scale distributed systems, solving complex problems, and setting the bar for engineering excellence for his team.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignleft size-full wp-image-90405" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/BDB-5843-image-12.jpeg" alt="Photo of AUthor - Steve Phillips" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/stevephillipsca" target="_blank" rel="noopener noreferrer"&gt;Steve&lt;/a&gt; is a Principal Technical Account Manager and Analytics specialist at AWS in the North America region. Steve currently focuses on data warehouse architectural design, data lakes, data ingestion pipelines, and cloud distributed architectures.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
					
			
		
		
			</item>
		<item>
		<title>Configure a custom domain name for your Amazon MSK cluster enabled with IAM authentication</title>
		<link>https://aws.amazon.com/blogs/big-data/configure-a-custom-domain-name-for-your-amazon-msk-cluster-enabled-with-iam-authentication/</link>
					
		
		<dc:creator><![CDATA[Mazrim Mehrtens]]></dc:creator>
		<pubDate>Tue, 21 Apr 2026 16:33:29 +0000</pubDate>
				<category><![CDATA[Amazon Managed Streaming for Apache Kafka (Amazon MSK)]]></category>
		<category><![CDATA[Expert (400)]]></category>
		<category><![CDATA[Technical How-to]]></category>
		<guid isPermaLink="false">b127855f1f99dc9a8ba267ae5be51e65c2868798</guid>

					<description>In the first part of Configure a custom domain name for your Amazon MSK cluster, we discussed about why custom domain names are important and provided details on how to configure a custom domain name in Amazon MSK when using SASL_SCRAM authentication. In this post, we discuss how to configure a custom domain name in Amazon MSK when using IAM authentication.</description>
										<content:encoded>&lt;p&gt;Most &lt;a href="https://aws.amazon.com/msk/" target="_blank" rel="noopener noreferrer"&gt;Amazon Managed Streaming for Apache Kafka&lt;/a&gt; (Amazon MSK) customers are simplifying and standardizing access control to Kafka resources using&amp;nbsp;&lt;a href="https://aws.amazon.com/iam/" target="_blank" rel="noopener noreferrer"&gt;AWS Identity and Access Management&lt;/a&gt;&amp;nbsp;(IAM) authentication. This adoption is also accelerated as &lt;a href="https://aws.amazon.com/blogs/big-data/amazon-msk-iam-authentication-now-supports-all-programming-languages/" target="_blank" rel="noopener noreferrer"&gt;Amazon MSK now supports IAM authentication in popular languages&lt;/a&gt; including&amp;nbsp;&lt;a href="https://github.com/aws/aws-msk-iam-auth" target="_blank" rel="noopener noreferrer"&gt;Java&lt;/a&gt;, &lt;a href="https://github.com/aws/aws-msk-iam-sasl-signer-python" target="_blank" rel="noopener noreferrer"&gt;Python&lt;/a&gt;, &lt;a href="https://github.com/aws/aws-msk-iam-sasl-signer-go" target="_blank" rel="noopener noreferrer"&gt;Go&lt;/a&gt;, &lt;a href="https://github.com/aws/aws-msk-iam-sasl-signer-js" target="_blank" rel="noopener noreferrer"&gt;JavaScript&lt;/a&gt;, and &lt;a href="https://github.com/aws/aws-msk-iam-sasl-signer-net" target="_blank" rel="noopener noreferrer"&gt;.NET&lt;/a&gt;.&lt;/p&gt; 
&lt;p&gt;In the first part of &lt;a href="https://aws.amazon.com/blogs/big-data/configure-a-custom-domain-name-for-your-amazon-msk-cluster/" target="_blank" rel="noopener noreferrer"&gt;Configure a custom domain name for your Amazon MSK cluster&lt;/a&gt;, we discussed about why custom domain names are important and provided details on how to configure a custom domain name in Amazon MSK when using SASL_SCRAM authentication. In this post, we discuss how to configure a custom domain name in Amazon MSK when using IAM authentication. We recommend you read the first part of this blog as it captures solution details implementation steps.&lt;/p&gt; 
&lt;h2&gt;Solution overview&lt;/h2&gt; 
&lt;p&gt;IAM authentication for Amazon MSK uses TLS to encrypt the Kafka protocol traffic between the client and Kafka broker. To use a custom domain name, the Kafka broker needs to present a server certificate that matches the custom domain name. To achieve this, this solution uses an&amp;nbsp;&lt;a href="https://aws.amazon.com/elasticloadbalancing/network-load-balancer/" target="_blank" rel="noopener noreferrer"&gt;Network Load Balancers (NLBs)&lt;/a&gt; with Amazon Certificate Manager to provide a custom certificate on behalf of the MSK brokers, and a Route 53 Private Hosted Zone to provide DNS for the custom domain name.&lt;/p&gt; 
&lt;p&gt;The following diagram shows all components used by the solution.&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="size-full wp-image-89068 aligncenter" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/03/18/image-1-2.jpg" alt="Architecture showing configuration of custom domain name with Amazon MSK" width="571" height="536"&gt;&lt;/p&gt; 
&lt;h3&gt;Certificate management&lt;/h3&gt; 
&lt;p&gt;For clients to perform TLS communication with the MSK cluster the cluster needs to provide a certificate with hostnames matching the custom domain name. This solution uses a certificate in&amp;nbsp;&lt;a href="https://aws.amazon.com/certificate-manager/" target="_blank" rel="noopener noreferrer"&gt;AWS Certificate Manager&lt;/a&gt; (ACM) signed with a Private Certificate Authority (PCA) for TLS with the custom domain name. This solution uses a&amp;nbsp;certificate with&amp;nbsp;&lt;code&gt;bootstrap.example.com&lt;/code&gt; as the Common Name (CN) so that the certificate is valid for the bootstrap address, and Subject Alternative Names (SANs) are set for all broker DNS names (such as&amp;nbsp;&lt;code&gt;b-1.example.com&lt;/code&gt;). Since this solution uses a private certificate authority, the CA chain must be imported into the client trust stores.&lt;/p&gt; 
&lt;p&gt;This solution works with any server certificate, whether certificates are signed by a public or private Certificate Authority (CA). You can import existing certificates into ACM to be used with this solution. Certificates must provide a common name and/or subject alternative names that match the bootstrap DNS address as well as the individual broker DNS addresses. If the certificate is issued by a private CA, clients need to import the root and intermediate CA certificates to the client trust store. If the certificate is issued by a public CA, the root and intermediate CA certificates will be in the default trust store.&lt;/p&gt; 
&lt;h3&gt;Network Load Balancer&lt;/h3&gt; 
&lt;p&gt;The NLB provides the ability to use a &lt;a href="https://docs.aws.amazon.com/elasticloadbalancing/latest/network/create-tls-listener.html" target="_blank" rel="noopener noreferrer"&gt;TLS listener&lt;/a&gt;. The ACM certificate is associated with the listeners and enables TLS negotiation between the client and the NLB. The NLB performs a separate TLS negotiation between itself and the MSK brokers. In addition to the above architecture, this solution also allows using AWS Private Link to connect the cluster to external VPCs. This allows secure access to MSK between VPCs while using a custom domain name.&lt;/p&gt; 
&lt;p&gt;The following diagram illustrates the NLB port and target configuration. A TLS listener with port 9000 is used for bootstrap connections with all MSK brokers set as targets. IAM authentication is configured to run on port 9098 of the MSK brokers using a TLS target type. A TLS listener port is used to represent each broker in the MSK cluster. In this post, there are three brokers in the MSK cluster starting with port 9001, representing broker 1 and up to port 9003, representing broker 3.&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="size-full wp-image-89067 aligncenter" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/03/18/image-2-2.jpg" alt="Target Group mapping in NLB" width="1001" height="748"&gt;&lt;/p&gt; 
&lt;h3&gt;Domain Name System (DNS)&lt;/h3&gt; 
&lt;p&gt;For the client to resolve DNS queries for the custom domain, we use an &lt;a href="https://aws.amazon.com/route53/" target="_blank" rel="noopener noreferrer"&gt;Amazon Route 53&lt;/a&gt; &lt;a href="https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/hosted-zones-private.html" target="_blank" rel="noopener noreferrer"&gt;private hosted zone&lt;/a&gt; to host the DNS records, and associate it with the client’s VPC to enable DNS resolution from the &lt;a href="https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/resolver.html" target="_blank" rel="noopener noreferrer"&gt;Route 53 VPC resolver&lt;/a&gt;. This solution uses a private MSK cluster and private DNS. For publicly accessible MSK clusters a public NLB and DNS provider such as a &lt;a href="https://docs.aws.amazon.com/Route53/latest/DeveloperGuide/AboutHZWorkingWith.html"&gt;Route53 public hosted zone&lt;/a&gt; can be used.&lt;/p&gt; 
&lt;h3&gt;Amazon MSK&lt;/h3&gt; 
&lt;p&gt;Finally, each broker needs to have its advertised listeners configuration (&lt;code&gt;advertised.listeners&lt;/code&gt;) updated to match the custom domain name and NLB ports.&amp;nbsp;Advertised listeners is a configuration option used by Kafka clients to connect to the brokers. By default, an advertised listener is not set. Once set, Kafka clients use the advertised listener instead of&amp;nbsp;&lt;code&gt;listeners&lt;/code&gt; to obtain the connection information for brokers.&amp;nbsp;MSK brokers use the listener configuration to tell clients the DNS names and ports to use to connect to the individual brokers for each authentication type enabled. Advertised listeners are unique to each broker; and the cluster won’t start if multiple brokers have the same advertised listener address. For this reason, this solution uses a unique custom DNS name for each broker&amp;nbsp;(such as,&amp;nbsp;&lt;code&gt;b-1.example.com&lt;/code&gt;).&lt;/p&gt; 
&lt;h2&gt;Solution Deployment&lt;/h2&gt; 
&lt;p&gt;To deploy the solution, use the CloudFormation template from the &lt;a href="https://github.com/aws-samples/sample-msk-custom-domain-name-iam-auth" target="_blank" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; repository.&lt;/p&gt; 
&lt;p&gt;This template deploys a VPC, NLB, PCA, ACM certificate, MSK cluster, and an Amazon EC2 instance for cluster connectivity. The EC2 instance includes a script to handle updating the broker &lt;code&gt;advertised.listeners&lt;/code&gt; settings to match the custom domain name. For more information on deploying a CloudFormation template, refer to&amp;nbsp;&lt;a href="https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/cfn-console-create-stack.html"&gt;Create a stack from the CloudFormation console&lt;/a&gt;.&lt;/p&gt; 
&lt;p&gt;After deploying the CloudFormation template, run the script to update advertised listeners as follows:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;Retrieve the &lt;strong&gt;MSKClusterARN&lt;/strong&gt; and &lt;strong&gt;CertificateAuthorityARN&lt;/strong&gt; from the CloudFormation outputs for your stack as they will be used in subsequent steps.&lt;br&gt; &lt;img loading="lazy" class="size-full wp-image-89066 aligncenter" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/03/18/image-3-4.png" alt="" width="2324" height="1052"&gt;&lt;/li&gt; 
 &lt;li&gt;Navigate to the EC2 console and identify the KafkaClientInstance. Choose &lt;strong&gt;Connect&lt;/strong&gt; to connect to the instance using &lt;a href="https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager.html" target="_blank" rel="noopener noreferrer"&gt;AWS Systems Manager Session Manager&lt;/a&gt;.&lt;/li&gt; 
 &lt;li&gt;Session Manager starts a session in shell. Start a bash session with the command: 
  &lt;div class="hide-language"&gt; 
   &lt;pre&gt;&lt;code class="lang-shell"&gt;bash -l&lt;/code&gt;&lt;/pre&gt; 
  &lt;/div&gt; &lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-89913" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/07/msk-iam-domain-image-4.jpg" alt="" width="656" height="149"&gt;&lt;/p&gt;&lt;/li&gt; 
 &lt;li&gt;The Kafka client SDKs have already been installed in the EC2 instance. You can update the &lt;code&gt;advertised.listeners&lt;/code&gt; configuration as follows, replacing &lt;strong&gt;CLUSTER_ARN&lt;/strong&gt; with the ARN of your MSK cluster retrieved from CloudFormation in step 1: 
  &lt;div class="hide-language"&gt; 
   &lt;pre&gt;&lt;code class="lang-shell"&gt;./update_advertised_listeners.sh --region us-east-1 --cluster-arn CLUSTER_ARN&lt;/code&gt;&lt;/pre&gt; 
  &lt;/div&gt; &lt;p&gt;Note that once this script completes, the brokers will have new advertised listeners configurations. Connections using the standard IAM address for the MSK service will not work until we complete the next steps, as the brokers will redirect connections over this address back to the custom domain name and TLS will fail.&lt;/p&gt;&lt;/li&gt; 
 &lt;li&gt;Next, we need to create a truststore with the certificate for our AWS Private Certificate Authority (PCA) to allow TLS with the NLB. In the following command, replace &lt;strong&gt;PCA_ARN&lt;/strong&gt; with the ARN of the PCA retrieved from CloudFormation in step 1:&lt;br&gt; &lt;img loading="lazy" class="size-full wp-image-89064 aligncenter" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/03/18/image-5-3.png" alt="" width="2324" height="836"&gt;We’re using the default Java truststore which uses the password &lt;code&gt;changeit&lt;/code&gt;.When asked “Trust this certificate?” enter “yes”.&lt;p&gt;&lt;/p&gt; 
  &lt;div class="hide-language"&gt; 
   &lt;pre&gt;&lt;code class="lang-shell"&gt;export&amp;nbsp;PCA_ARN=&amp;lt;&amp;lt;PCA_ARN&amp;gt;&amp;gt;
export&amp;nbsp;REGION=&amp;lt;&amp;lt;REGION&amp;gt;&amp;gt;

cp /etc/pki/java/cacerts . &amp;amp;&amp;amp; chmod 600 cacerts
aws acm-pca get-certificate-authority-certificate --certificate-authority-arn $PCA_ARN --region $REGION&amp;nbsp;| jq -r '.Certificate' &amp;gt; pca.pem
keytool -import -file pca.pem -alias AWSPCA -keystore&amp;nbsp;cacerts&lt;/code&gt;&lt;/pre&gt; 
  &lt;/div&gt; &lt;/li&gt; 
 &lt;li&gt;Create a new properties file to allow IAM authentication with our custom truststore: 
  &lt;div class="hide-language"&gt; 
   &lt;pre&gt;&lt;code class="lang-shell"&gt;cat &amp;lt;&amp;lt;EOF &amp;gt;&amp;gt; /home/ssm-user/client-iam.properties
ssl.truststore.location=/home/ssm-user/cacerts
ssl.truststore.password=changeit
EOF&lt;/code&gt;&lt;/pre&gt; 
  &lt;/div&gt; &lt;/li&gt; 
 &lt;li&gt;Verify you can connect to the cluster using IAM authentication using our new custom domain name, replacing bootstrap.example.com with your own custom domain name if you used a different one in CloudFormation: 
  &lt;div class="hide-language"&gt; 
   &lt;pre&gt;&lt;code class="lang-code"&gt;bin/kafka-topics.sh --list --command-config client-iam.properties --bootstrap-server bootstrap.example.com:9000&lt;/code&gt;&lt;/pre&gt; 
  &lt;/div&gt; &lt;p&gt;&lt;img loading="lazy" class="alignnone wp-image-89544 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/03/26/bdb5167i6.jpg" alt="" width="2560" height="360"&gt;&lt;/p&gt;&lt;/li&gt; 
&lt;/ol&gt; 
&lt;h2&gt;Cleanup&lt;/h2&gt; 
&lt;p&gt;To stop incurring costs navigate to CloudFormation and delete the CloudFormation stack to remove all resources provisioned by CloudFormation.&lt;/p&gt; 
&lt;h2&gt;Frequently Asked Question about Custom Domain Name&lt;/h2&gt; 
&lt;p&gt;Customers have asked a few questions about implementing custom domain names with MSK. You can find answers to some of the most popular questions here.&lt;/p&gt; 
&lt;h3&gt;Are there any limitations for this solution on MSK?&lt;/h3&gt; 
&lt;p&gt;The &lt;code&gt;advertised.listeners&lt;/code&gt; setting was removed as a dynamic broker in KRaft-based Kafka clusters. Therefore, this solution is only supported in Zookeeper-based MSK clusters. Additionally, this solution is only applicable to SASL/SCRAM and IAM-authentication based MSK clusters.&lt;/p&gt; 
&lt;h3&gt;How the custom domain name solution scales when we add new brokers?&lt;/h3&gt; 
&lt;p&gt;When using the NLB for broker connectivity (&lt;a href="https://aws.amazon.com/blogs/big-data/configure-a-custom-domain-name-for-your-amazon-msk-cluster/#:~:text=Option%202%3A%20All%20connections%20through%20an%20NLB" target="_blank" rel="noopener noreferrer"&gt;option 2 in the configure a custom domain name for your Amazon MSK cluster blog post&lt;/a&gt;), you will need to add an additional listener for each additional broker created.&lt;/p&gt; 
&lt;p&gt;For TLS, if using Subject Alternative Name (SAN) to list individual broker DNS hostnames, you will need to create a new certificate that includes the names of the additional brokers. One option is to create a certificate with SANs for more brokers than needed to allow for growth.If a wildcard certificate is used, you do not need to modify certificates when adding brokers.&lt;/p&gt; 
&lt;h3&gt;What changes are required when we remove brokers?&lt;/h3&gt; 
&lt;p&gt;Amazon MSK supports scale-in by removing brokers from the cluster. Brokers are removed from each availability zones (AZ). So a 6 broker Amazon MSK cluster deployed in 3 AZ can be reduced to 3 broker cluster deployed in 3 AZ. When brokers are removed, you can remove the NLB listeners for the removed broker along with the Route53 DNS endpoints. However, you can also leave them as is, or just remove the target IP from the broker numbers target group. The NLB will mark the targets as unhealthy and stop directing traffic to them. If you ever plan to scale-out the number of brokers, you can re-use the existing NLB listeners and Route 53 DNS entries and would only need to update the target IPs used in the broker numbers target group.&lt;/p&gt; 
&lt;h3&gt;Is there any change in configuration required if there is any broker failure?&lt;/h3&gt; 
&lt;p&gt;No. When a broker fails, Amazon MSK replaces the failed broker with a new broker instance keeping the configuration of the broker exactly the same. So, there would be no change in the advertised listener of the broker. Once the broker is healthy, the broker can accept new connections and read/write traffic.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Can you use Amazon MSK Replicator between MSK clusters in multiple AWS Regions when using the custom domain name solution?&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;The &lt;a href="https://aws.amazon.com/msk/features/msk-replicator/" target="_blank" rel="noopener noreferrer"&gt;Amazon MSK Replicator&lt;/a&gt; can be used when using the custom domain name solution, either in an active-passive or active-active setup. The same process can be followed to set the custom domain name.&lt;/p&gt; 
&lt;p&gt;You then follow &lt;a href="https://aws.amazon.com/blogs/big-data/build-multi-region-resilient-apache-kafka-applications-with-identical-topic-names-using-amazon-msk-and-amazon-msk-replicator/" target="_blank" rel="noopener noreferrer"&gt;build multi-Region resilient Apache Kafka applications with identical topic names using Amazon MSK and Amazon MSK Replicator&lt;/a&gt; post to configure MSK Replicator.&lt;/p&gt; 
&lt;p&gt;The following diagram shows an active-active AWS multi-Region MSK setup using the custom domain name solution:&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="size-full wp-image-89062 aligncenter" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/03/18/image-7-5.png" alt="" width="1430" height="823"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Can I use a global bootstrap DNS name to connect to Amazon MSK clusters deployed across multiple AWS regions when IAM authentication is enabled?&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;No, it is not possible to use a global bootstrap reference to represent MSK clusters deployed in multiple AWS Regions, unless the client is aware of the cluster’s region when connecting. To use IAM authentication, the correct AWS Region must be included in the IAM authentication request for a given cluster. This is because the AWS Region is a part of the Sigv4 authentication protocol used by IAM. This scope prevents the IAM authorization being used to talk to a resource in another AWS Region. You can provide the AWS Region in one of two ways– with region-specific bootstrap URLs or by explicitly configuring the region.&lt;/p&gt; 
&lt;p&gt;For example, if the bootstrap string is &lt;a href="http://bootstrap.us-east-1.example.com/" target="_blank" rel="noopener noreferrer"&gt;bootstrap.us-east-1.example.com&lt;/a&gt;, then &lt;a href="https://github.com/aws/aws-msk-iam-auth" target="_blank" rel="noopener noreferrer"&gt;msk-iam-auth&lt;/a&gt; library will to extract the AWS Region from the broker connection string and use us-east-1 in its IAM requests. If the bootstrap string is simply &lt;a href="http://bootstrap.example.com/" target="_blank" rel="noopener noreferrer"&gt;bootstrap.example.com&lt;/a&gt;, then the client must explicitly configure AWS_REGION=us-east-1 to connect to the cluster if it is in us-east-1, or us-west-2 if it is in us-west-2.&lt;/p&gt; 
&lt;p&gt;Note that this is a limitation for IAM authentication, but not for SASL/SCRAM authentication. With SASL/SCRAM authentication, if the client’s credentials are applied to both clusters the global endpoint can point to either cluster and the client will be able to connect. The AWS Region is not used in SASL/SCRAM authentication, so it does not restrict the authentication scope.&lt;/p&gt; 
&lt;h3&gt;How to allow public access to a private MSK cluster using the custom domain name solution?&lt;/h3&gt; 
&lt;p&gt;To provide public access to a MSK cluster using the custom domain solution, you will need to do the following:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Create an Internet-facing NLB, and associate public subnets (subnets that have a route to the Internet Gateway attached to the VPC).&lt;/li&gt; 
 &lt;li&gt;Create ingress rules in both the NLB and MSK security groups permitting the required public addresses. Note: the port will be 9098 for the MSK security group, and the ports you are using on the NLB listeners.&lt;/li&gt; 
 &lt;li&gt;Provide public DNS resolution for the Kafka clients, by using a Route 53 public zone, or an alternative public DNS resolver.&lt;/li&gt; 
 &lt;li&gt;The client needs have IAM credentials, with permission, to talk to the MSK brokers, using an &lt;a href="https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles.html" target="_blank" rel="noopener noreferrer"&gt;IAM role&lt;/a&gt;,&amp;nbsp;&lt;a href="https://docs.aws.amazon.com/IAM/latest/UserGuide/security-creds-programmatic-access.html" target="_blank" rel="noopener noreferrer"&gt;IAM access keys&lt;/a&gt;, &lt;a href="https://aws.amazon.com/iam/roles-anywhere/" target="_blank" rel="noopener noreferrer"&gt;IAM Roles Anywhere&lt;/a&gt;, or another mechanism that uses the AWS Security Token Service (AWS STS) to create and provide trusted users with temporary security credentials.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;&lt;img loading="lazy" class="size-full wp-image-89061 aligncenter" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/03/18/image-8.jpg" alt="" width="1478" height="1262"&gt;&lt;/p&gt; 
&lt;h3&gt;In the first part of the blog, two patterns have been highlighted.&amp;nbsp;How to decide which pattern to use and why?&lt;/h3&gt; 
&lt;h3&gt;Option 1: Only bootstrap connection through NLB&lt;/h3&gt; 
&lt;p&gt;If the Kafka clients have direct access to the broker, then you can use custom domain name for the bootstrap connection while the clients can still connect to the MSK Brokers with broker DNS. This is the simplest option, as it does not require custom TLS certificates or TLS listeners.Note that this option is not necessary when using MSK Express brokers, as MSK Express brokers already manages bootstrapping via a broker-agnostic connection string. For MSK Express, this option does not add value other than configuring a custom domain name for appearances / simplicity of client configuration. For MSK Standard brokers, this can improve client connectivity by making connection strings broker agnostic.&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="size-full wp-image-89060 aligncenter" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/03/18/image-9.jpg" alt="" width="625" height="107"&gt;&lt;/p&gt; 
&lt;h3&gt;Option 2:&amp;nbsp;All connections through NLB&lt;/h3&gt; 
&lt;p&gt;When Kafka clients don’t have direct access to Amazon MSK Brokers, routing all connections through the NLB can be preferred. This can occur when a client is deployed in a different VPC than Amazon MSK VPC or the client is external, and when Amazon MSK Multi VPC Connectivity is not an option. In general, Amazon MSK Multi VPC Connectivity is preferred as this is a simpler pattern for most organizations to manage MSK Connectivity across accounts and VPCs.When Multi VPC Connectivity is not an option, NLB can be used to provide connectivity with Transit Gateway or PrivateLink, and the solution mentioned in the blog should be used.&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="size-full wp-image-89059 aligncenter" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/03/18/image-10.jpg" alt="" width="623" height="168"&gt;&lt;/p&gt; 
&lt;p&gt;Here is an example architecture how Kafka client and Amazon MSK cluster deployed in two separate VPCs but connected via AWS Private Link.&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="size-full wp-image-89058 aligncenter" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/03/18/image-11-2.png" alt="" width="1287" height="611"&gt;&lt;/p&gt; 
&lt;h3&gt;Is Amazon Route 53 required to use a custom domain name with Amazon MSK?&lt;/h3&gt; 
&lt;p&gt;You can use an alternative DNS resolver service, and do not require Amazon Route 53 to use a custom domain name with Amazon MSK. The only requirement is that your clients can resolve against your DNS resolver service. The only change required, is to use a CNAME for the DNS records, referencing the &lt;a href="https://docs.aws.amazon.com/elasticloadbalancing/latest/network/network-load-balancers.html#dns-name" target="_blank" rel="noopener noreferrer"&gt;NLBs DNS record&lt;/a&gt;, in place of the Alias records, as this is record type is only available in Amazon Route 53.&lt;/p&gt; 
&lt;h3&gt;We don’t use Amazon Certificate Manager (ACM), can NLB integrate with other 3rd party certificate managers?&lt;/h3&gt; 
&lt;p&gt;NLB only supports ACM to bind a certificate to a TLS listener. You can import a certificate created using your 3rd party certificate manager into ACM, and do not need to create a certificate using ACM.&lt;/p&gt; 
&lt;h3&gt;Getting connection to node terminated during authentication after setting&amp;nbsp;&lt;code&gt;advertised.listeners&lt;/code&gt;&amp;nbsp;, what could be the issue?&lt;/h3&gt; 
&lt;p&gt;As the issue started to occur after changing the&amp;nbsp;&lt;code&gt;advertised.listeners&lt;/code&gt;&amp;nbsp;configuration, the issue is unlikely to be related to permissions. The following can cause this issue:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;The NLB and/or client’s Security Group does not permit access to the listener ports on the NLB from the client.&lt;/li&gt; 
 &lt;li&gt;A firewall appliance between the NLB and client does not permit the client to talk to the NLB using the listener ports.&lt;/li&gt; 
 &lt;li&gt;The&amp;nbsp;&lt;code&gt;advertised.listeners&lt;/code&gt;&amp;nbsp;configuration has an error causing the client to receive invalid details, such as a typo in the name. If this is the case, use a client in the same VPC as the MSK broker that has IAM permissions to talk to the MSK broker, and Security Group rules permitting connectivity, you then use the following command to delete the&amp;nbsp;&lt;code&gt;advertised.listeners&lt;/code&gt;&amp;nbsp;configuration.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-shell"&gt;/home/ec2-user/kafka/bin/kafka-configs.sh --alter \
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; --bootstrap-server  \
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; --entity-type brokers \
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; --entity-name  \
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; --command-config ~/kafka/config/client_iam.properties \
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; --delete-config advertised.listeners&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;BROKERS_AMAZON_DNS_NAME such as&amp;nbsp;&lt;code&gt;b-1.clustername.xxxxxx.yy.kafka.region.amazonaws.com:9098&lt;/code&gt;.&lt;/p&gt; 
&lt;h3&gt;Getting “unexpected broker id, expected 2 or empty string, but received 1”, what is causing this error?&lt;/h3&gt; 
&lt;p&gt;This error is typically presented when the&amp;nbsp;&lt;code&gt;advertised.listeners&lt;/code&gt;&amp;nbsp;configuration for one of the brokers has the port used by another broker set. For example broker 2 has port 9001 set for IAM, but this port is used to connect to broker 1, so broker 1 is responding with an error to say you presented broker id 2, but I am broker 1.&lt;/p&gt; 
&lt;p&gt;To correct this, you will need to update the broker with the incorrect&amp;nbsp;&lt;code&gt;advertised.listeners&lt;/code&gt;&amp;nbsp;configuration to use the correct port. To gain access to the broker to make the change, you will need to use the following command to delete the incorrect configuration:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-shell"&gt;/home/ec2-user/kafka/bin/kafka-configs.sh --alter \
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; --bootstrap-server \
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; --entity-type brokers \
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; --entity-name  \
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; --command-config ~/kafka/config/client_iam.properties \
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; --delete-config advertised.listeners&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;BROKERS_AMAZON_DNS_NAME such as&amp;nbsp;&lt;code&gt;b-2.clustername.xxxxxx.yy.kafka.region.amazonaws.com:9098&lt;/code&gt;.&lt;/p&gt; 
&lt;p&gt;You then need to use the following command to set the&amp;nbsp;&lt;code&gt;advertised.listeners&lt;/code&gt;&amp;nbsp;configuration for that broker:&lt;/p&gt; 
&lt;p&gt;Note:&amp;nbsp;The&amp;nbsp;&lt;code&gt;advertised.listeners&lt;/code&gt;&amp;nbsp;configuration in the below assumes only IAM is used for authentication. If you are using additional authentication options, you will need to include them.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-shell"&gt;MSKDOMAIN=
broker_id=
Domain=

/home/ec2-user/kafka/bin/kafka-configs.sh&amp;nbsp;--alter&amp;nbsp;\
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; --bootstrap-server&amp;nbsp;&amp;nbsp;\
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; --entity-type&amp;nbsp;brokers&amp;nbsp;\
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; --entity-name&amp;nbsp;"$broker_id"&amp;nbsp;\
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; --command-config&amp;nbsp;~/kafka/config/client_iam.properties&amp;nbsp;\
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; --add-config&amp;nbsp;"advertised.listeners=[CLIENT_IAM://b-$broker_id.$Domain:900$broker_id,REPLICATION://b-$broker_id-internal.$MSKDOMAIN:9093,REPLICATION_SECURE://b-$broker_id-internal.$MSKDOMAIN:9095]"&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;h2&gt;Summary&lt;/h2&gt; 
&lt;p&gt;In this post, we explained how you can use an NLB, Route 53, and the advertised listener configuration option in Amazon MSK to support custom domain names with MSK clusters when using IAM authentication. You can use this solution to keep your existing Kafka bootstrap DNS name and reduce or remove the need to change client applications because of a migration, recovery process, or to use a DNS name in line with your organization’s naming convention (for example,&amp;nbsp;msk.prod.example.com).&lt;/p&gt; 
&lt;p&gt;Try the solution out for yourself, and leave your questions and feedback in the comments section.&lt;/p&gt; 
&lt;hr style="width: 80%"&gt; 
&lt;h2&gt;About the authors&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="aligncenter size-full wp-image-29797" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2024/04/26/subham.jpg" alt="Subham Rakshit" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Subham Rakshit&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/subhamrakshit/" target="_blank" rel="noopener"&gt;Subham&lt;/a&gt; is a Senior Streaming Solutions Architect for Analytics at AWS based in the UK. He works with customers to design and build streaming architectures so they can get value from analyzing their streaming data. His two little daughters keep him occupied most of the time outside work, and he loves solving jigsaw puzzles with them.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="aligncenter size-full wp-image-29797" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2024/06/17/mgtaylor_headshot.jpg" alt="Mark Taylor" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Mark Taylor&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/mark-taylor-5b77a525/" target="_blank" rel="noopener"&gt;Mark&lt;/a&gt; is a Senior Technical Account Manager at AWS, working with enterprise customers to implement best practices, optimize AWS usage, and address business challenges. Mark lives in Folkestone, England, with his wife and two dogs. Outside of work, he enjoys watching and playing football, watching movies, playing board games, and traveling.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignnone size-full wp-image-89475" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/03/24/bdb-5775-mmehrten-headshot.png" alt="" width="100" height="107"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Mazrim Mehrtens&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/mmehrtens/" target="_blank" rel="noopener"&gt;Mazrim&lt;/a&gt; is a Sr. Specialist Solutions Architect for messaging and streaming workloads. Mazrim works with customers to build and support systems that process and analyze terabytes of streaming data in real time, run enterprise Machine Learning pipelines, and create systems to share data across teams seamlessly with varying data toolsets and software stacks.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
					
			
		
		
			</item>
		<item>
		<title>Migrate third-party and self-managed Apache Kafka clusters to Amazon MSK Express brokers with Amazon MSK Replicator</title>
		<link>https://aws.amazon.com/blogs/big-data/migrate-third-party-and-self-managed-apache-kafka-clusters-to-amazon-msk-express-brokers-with-amazon-msk-replicator/</link>
					
		
		<dc:creator><![CDATA[Ankita Mishra]]></dc:creator>
		<pubDate>Mon, 20 Apr 2026 20:00:27 +0000</pubDate>
				<category><![CDATA[Amazon Managed Streaming for Apache Kafka (Amazon MSK)]]></category>
		<category><![CDATA[Analytics]]></category>
		<category><![CDATA[Announcements]]></category>
		<category><![CDATA[Migration]]></category>
		<category><![CDATA[Technical How-to]]></category>
		<guid isPermaLink="false">8cb88c43081d8e03912134cd163fcbe713a04c24</guid>

					<description>In this post, we walk you through how to replicate Apache Kafka data from your external Apache Kafka deployments to Amazon MSK Express brokers using MSK Replicator. You will learn how to configure authentication on your external cluster, establish network connectivity, set up bidirectional replication, and monitor replication health to achieve a low-downtime migration.</description>
										<content:encoded>&lt;p&gt;Migrating Apache Kafka workloads to the cloud often involves managing complex replication infrastructure, coordinating application cutovers with extended downtime windows, and maintaining deep expertise in open-source tools like Apache Kafka’s MirrorMaker 2 (MM2). These challenges slow down migrations and increase operational risk. &lt;a href="https://docs.aws.amazon.com/msk/latest/developerguide/msk-replicator.html" target="_blank" rel="noopener noreferrer"&gt;Amazon MSK Replicator&lt;/a&gt; addresses these challenges, enabling you to migrate your Kafka deployments (referred to as “external” Kafka clusters) to &lt;a href="https://aws.amazon.com/msk/" target="_blank" rel="noopener noreferrer"&gt;Amazon MSK&lt;/a&gt; &lt;a href="https://docs.aws.amazon.com/msk/latest/developerguide/msk-broker-types-express.html" target="_blank" rel="noopener noreferrer"&gt;Express brokers&lt;/a&gt; with minimal operational overhead and reduced downtime. MSK Replicator supports data migration from Kafka deployments (version 2.8.1 or later) that have &lt;a href="https://kafka.apache.org/42/security/authentication-using-sasl/" target="_blank" rel="noopener noreferrer"&gt;SASL/SCRAM authentication&lt;/a&gt; enabled – including Kafka clusters running on-premises, on AWS, or other cloud providers, as well as Kafka-protocol-compatible services like Confluent Platform, Avien, RedPanda, WarpStream, or AutoMQ when configured with SASL/SCRAM authentication.&lt;/p&gt; 
&lt;p&gt;In this post, we walk you through how to replicate Apache Kafka data from your external Apache Kafka deployments to Amazon MSK Express brokers using MSK Replicator. You will learn how to configure authentication on your external cluster, establish network connectivity, set up bidirectional replication, and monitor replication health to achieve a low-downtime migration.&lt;/p&gt; 
&lt;h2&gt;How it works&lt;/h2&gt; 
&lt;p&gt;MSK Replicator is a fully managed serverless service that replicates topics, configurations, and offsets from cluster to cluster. It alleviates the need to manage complex infrastructure or configure open-source tools.&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90053" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/10/BDB-5876-1.png" alt="" width="3084" height="3044"&gt;&lt;/p&gt; 
&lt;p&gt;Before MSK Replicator, customers used tools like MM2 for migrations. These tools lack bi-directional topic replication when using the same topic names, creating complex application architectures to consume different topics on different clusters. Custom replication policies in MM2 can allow identical topic names, but MM2 still lacks bidirectional offset replication because the MM2 architecture requires producers and consumers to run on the same cluster to replicate offsets. This created complex migrations that required either migrating consumers before producers or big-bang migrations migrating all applications at once. When customers run into issues during the migration, the rollback process is error-prone and introduces large amounts of duplicate message processing due to the lack of consumer group offset synchronization. These approaches create risk and complexity for customers that make migrations difficult to manage.&lt;/p&gt; 
&lt;p&gt;MSK Replicator addresses these problems by supporting bidirectional replication of data and enhanced consumer group offset synchronization. MSK Replicator copies topics and offsets from an external Kafka cluster to MSK, allowing you to preserve the same topic and consumer group names on both clusters. MSK Replicator also supports creating a second Replicator instance for bidirectional replication of both data and enhanced offset synchronization, allowing producers and consumers to run independently on different Kafka clusters. Data published or consumed on the Amazon MSK cluster will be replicated back to the external cluster by the second Replicator. This feature works when producers and consumers are migrated regardless of order without worrying about dependencies between applications.&lt;/p&gt; 
&lt;p&gt;Because MSK Replicator provides bidirectional data replication and enhanced consumer group offset synchronization, you can move producers and consumers at your own pace without data loss. This reduces migration complexity, allowing you to migrate applications between your external Kafka cluster and Amazon MSK regardless of order. If you run into problems during the migration, enhanced offset synchronization allows you to roll back changes by moving applications back to the external Kafka cluster, where they restart from the latest checkpoint from the Amazon MSK cluster.&lt;/p&gt; 
&lt;p&gt;For example, consider three applications:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;The “Orders” application, which accepts incoming orders and writes them to the orders Kafka topic&lt;/li&gt; 
 &lt;li&gt;The “Order status” application, which reads from the “orders” Kafka topic and writes status updates to the &lt;code&gt;order_status&lt;/code&gt; topic&lt;/li&gt; 
 &lt;li&gt;The “Customer notification” application, which reads from the &lt;code&gt;order_status&lt;/code&gt; topic and notifies customers when status changes&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90054" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/10/BDB-5876-2.png" alt="" width="3764" height="1364"&gt;&lt;/p&gt; 
&lt;p&gt;MSK Replicator enables these applications to be migrated between an on-premises Apache Kafka cluster and an Amazon MSK Express cluster with low downtime and no data loss, regardless of order. The “Order status” application can migrate first, receive orders from the on-premises “Orders” application, and send status updates to the on-premises “Customer notification” application. If issues arise during the migration, the “Order status” application can roll back to the on-premises cluster and its consumer group offsets for the orders topic will be ready for it to pick up from where it left off on the Amazon MSK cluster.&lt;/p&gt; 
&lt;p&gt;MSK Replicator supports data distribution across hybrid and multi-cloud environments for analytics, compliance, and business continuity. It is also configured for disaster recovery scenarios where Amazon MSK Express serves as a resilient target for your external Kafka clusters.&lt;/p&gt; 
&lt;p&gt;If you are currently using MM2 for replication, see &lt;a href="https://aws.amazon.com/blogs/big-data/amazon-msk-replicator-and-mirrormaker2-choosing-the-right-replication-strategy-for-apache-kafka-disaster-recovery-and-migrations/" target="_blank" rel="noopener noreferrer"&gt;Amazon MSK Replicator and MirrorMaker2: Choosing the right replication strategy for Apache Kafka disaster recovery and migrations&lt;/a&gt; to understand which solution best fits your use case.&lt;/p&gt; 
&lt;h2&gt;Solution overview&lt;/h2&gt; 
&lt;p&gt;MSK Replicator supports Kafka deployments running version 2.8.1 or later as a source, including 3rd party managed Kafka services, self-managed Kafka, and on-premises or third-party cloud-hosted Kafka. MSK Replicator automatically handles data transfer, uses SASL/SCRAM authentication with SSL encryption, and maintains consumer group positions across both clusters. If you do not use SASL/SCRAM today, this can be configured as a new listener used for MSK Replicator allowing current clients to use their existing authentication mechanisms alongside MSK Replicator.&lt;/p&gt; 
&lt;h2&gt;Prerequisites&lt;/h2&gt; 
&lt;p&gt;To follow along with this walkthrough, you need the following resources in place:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;A source Kafka cluster using &lt;a href="https://kafka.apache.org/community/downloads/#281" target="_blank" rel="noopener noreferrer"&gt;Kafka version 2.8.1&lt;/a&gt; or above&lt;/li&gt; 
 &lt;li&gt;Network connectivity between your external Kafka cluster and AWS (for example, using &lt;a href="https://aws.amazon.com/directconnect/" target="_blank" rel="noopener noreferrer"&gt;AWS Direct Connect&lt;/a&gt;, &lt;a href="https://aws.amazon.com/vpn/" target="_blank" rel="noopener noreferrer"&gt;Site-to-Site VPN&lt;/a&gt;, or &lt;a href="https://aws.amazon.com/vpc/" target="_blank" rel="noopener noreferrer"&gt;Amazon Virtual Private Cloud&lt;/a&gt; (VPC) &lt;a href="https://docs.aws.amazon.com/vpc/latest/peering/what-is-vpc-peering.html"&gt;peering&lt;/a&gt; or &lt;a href="https://aws.amazon.com/transit-gateway/"&gt;AWS Transit Gateway&lt;/a&gt; for connections between AWS VPCs) so that MSK Replicator can reach your source brokers&lt;/li&gt; 
 &lt;li&gt;SASL/SCRAM authentication configured on your external cluster (SHA-256 or SHA-512), which MSK Replicator uses to authenticate with external clusters&lt;/li&gt; 
 &lt;li&gt;An admin user configured on your external cluster with permissions to describe the external cluster and create and modify users/ACLs&lt;/li&gt; 
 &lt;li&gt;An Amazon MSK Express cluster with &lt;a href="https://docs.aws.amazon.com/msk/latest/developerguide/iam-access-control.html" target="_blank" rel="noopener noreferrer"&gt;IAM authentication enabled&lt;/a&gt; to serve as your target&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://aws.amazon.com/secrets-manager/" target="_blank" rel="noopener noreferrer"&gt;AWS Secrets Manager&lt;/a&gt; configured to store your SASL/SCRAM credentials for the external cluster so that MSK Replicator can securely retrieve them at runtime&lt;/li&gt; 
 &lt;li&gt;An &lt;a href="https://aws.amazon.com/cloudwatch/" target="_blank" rel="noopener noreferrer"&gt;Amazon CloudWatch&lt;/a&gt; log group for MSK Replicator logs&lt;/li&gt; 
 &lt;li&gt;Appropriate &lt;a href="https://docs.aws.amazon.com/msk/latest/developerguide/msk-replicator-create-iam-perms.html" target="_blank" rel="noopener noreferrer"&gt;IAM permissions for creating and managing MSK Replicator&lt;/a&gt; resources&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;Setting up replication&lt;/h2&gt; 
&lt;h3&gt;Step 1: Configure network connectivity&lt;/h3&gt; 
&lt;p&gt;You can set up network connectivity between your external Kafka cluster and your AWS VPC using methods such as AWS Direct Connect for dedicated network connections, AWS Site-to-Site VPN for encrypted connections over the internet, and AWS VPC peering or AWS Transit Gateway for connections between AWS VPCs. Verify that IP routing and DNS resolution are properly configured between your external cluster and AWS.&lt;/p&gt; 
&lt;p&gt;To verify IP routing and DNS resolution, connect to your external Kafka cluster from inside of your VPC by using the Kafka CLI to list topics on the external cluster. If you can list topics from your VPC using the Kafka CLI, this means DNS resolution and IP routing are working successfully. If it fails, work with your network admins to troubleshoot network connectivity issues.&lt;/p&gt; 
&lt;h3&gt;Step 2: Configure external cluster&lt;/h3&gt; 
&lt;p&gt;In this step, you will set up authentication on your external Kafka cluster and store the credentials in AWS Secrets Manager so that MSK Replicator can connect securely.&lt;/p&gt; 
&lt;h4&gt;Configure authentication&lt;/h4&gt; 
&lt;p&gt;Using the external cluster admin user, configure SASL/SCRAM authentication for MSK Replicator using SHA-256 or 512 on your external Kafka cluster. Create a SASL/SCRAM user for MSK Replicator and give the user the following ACL permissions:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Topic operations –&lt;/strong&gt; Alter, AlterConfigs, Create, Describe, DescribeConfigs, Read, Write&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Group operations –&lt;/strong&gt; Read, Describe&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Cluster operations –&lt;/strong&gt; Create, ClusterAction, Describe, DescribeConfigs&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h4&gt;Configure SecretsManager&lt;/h4&gt; 
&lt;p&gt;AWS Secrets Manager stores your SASL/SCRAM credentials securely so that MSK Replicator can retrieve them at runtime. The secret must use JSON format and have the following keys:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;code&gt;&lt;strong&gt;username&lt;/strong&gt;&lt;/code&gt; – The SCRAM username that you configured in the authentication step above&lt;/li&gt; 
 &lt;li&gt;&lt;code&gt;&lt;strong&gt;password&lt;/strong&gt;&lt;/code&gt; – The SCRAM password that you configured in the authentication step above&lt;/li&gt; 
 &lt;li&gt;&lt;code&gt;&lt;strong&gt;certificate&lt;/strong&gt;&lt;/code&gt; – The public root CA certificate (the top-level certificate authority that issued your cluster’s TLS certificate) and the intermediate CA chain (intermediate certificates between the root and your cluster’s certificate), used for SSL handshakes with the external cluster&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;Optionally, you may create separate secrets for SCRAM credentials and the SSL certificate. This approach is useful when secrets for SCRAM credentials and certificates are provisioned in different stages, such as in Infrastructure as Code (IaC) pipelines.&lt;/p&gt; 
&lt;h4&gt;Retrieve the cluster ID&lt;/h4&gt; 
&lt;p&gt;As the admin user, use the &lt;a href="https://downloads.apache.org/kafka/" target="_blank" rel="noopener noreferrer"&gt;Kafka CLI tools&lt;/a&gt; to retrieve the cluster ID of your external cluster. Run the following command, replacing &lt;code&gt;your-broker-host:9096&lt;/code&gt; with the address of one of your external cluster’s bootstrap servers:&lt;/p&gt; 
&lt;pre&gt;&lt;code class="lang-code"&gt;bin/kafka-cluster.sh cluster-id --bootstrap-server your-broker-host:9096 --config admin.properties&lt;/code&gt;&lt;/pre&gt; 
&lt;p&gt;The command returns a cluster ID string such as &lt;code&gt;lkc-abc123&lt;/code&gt;. Take note of this value because you will need it when creating the replicator in Step 4.&lt;/p&gt; 
&lt;h3&gt;Step 3: Create your MSK Express target cluster&lt;/h3&gt; 
&lt;p&gt;With your external cluster configured, you can now set up the target. Create an Amazon MSK Express cluster with IAM authentication enabled. Make sure that the cluster is in subnets that have access to &lt;a href="https://aws.amazon.com/secrets-manager/" target="_blank" rel="noopener noreferrer"&gt;AWS Secrets Manager&lt;/a&gt; endpoints. See &lt;a href="https://docs.aws.amazon.com/msk/latest/developerguide/getting-started.html" target="_blank" rel="noopener noreferrer"&gt;Get started using Amazon MSK&lt;/a&gt; for more information on creating an MSK cluster.&lt;/p&gt; 
&lt;h3&gt;Step 4: Create the replicator&lt;/h3&gt; 
&lt;p&gt;Now that both clusters are ready, you can connect them by setting up the MSK Replicator with the appropriate IAM role and replication configuration.&lt;/p&gt; 
&lt;h4&gt;Set up an IAM role for MSK Replicator&lt;/h4&gt; 
&lt;p&gt;MSK Replicator needs an IAM role to interact with your MSK Express cluster and retrieve secrets. Set up a service execution IAM role with a trust policy allowing &lt;code&gt;kafka.amazonaws.com&lt;/code&gt; and attach the &lt;code&gt;AWSMSKReplicatorExecutionRole&lt;/code&gt; permissions policy. Take note of the role ARN for creating the replicator.&lt;/p&gt; 
&lt;p&gt;Create and attach a policy for accessing your Secrets Manager secrets and reading/writing data in your MSK cluster. See &lt;a href="https://docs.aws.amazon.com/IAM/latest/UserGuide/access_policies_job-functions_create-policies.html" target="_blank" rel="noopener noreferrer"&gt;Creating roles and attaching policies (console)&lt;/a&gt; for more information on creating IAM roles and policies.&lt;/p&gt; 
&lt;p&gt;The following is an example policy for reading and writing data to your MSK cluster and reading KMS-encrypted Secrets Manager secrets:&lt;/p&gt; 
&lt;pre&gt;&lt;code class="lang-json"&gt;{&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp; "Version": "2012-10-17",&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp; "Statement": [&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; {&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "Sid": "SecretsManagerAccess",&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "Effect": "Allow",&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "Action": [&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "secretsmanager:GetSecretValue",&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "secretsmanager:DescribeSecret"&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ],&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "Resource": [&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "&amp;lt;SCRAM_SECRET_ARN&amp;gt;",&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "&amp;lt;CERT_SECRET_ARN&amp;gt;"&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ]&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; },&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; {&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "Sid": "KMSDecrypt",&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "Effect": "Allow",&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "Action": "kms:Decrypt",&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "Resource": "&amp;lt;SECRETSMANAGER_KMS_KEY_ARN&amp;gt;"&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; },&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; {&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "Sid": "TargetClusterAccess",&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "Effect": "Allow",&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "Action": [&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "kafka-cluster:Connect",&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "kafka-cluster:DescribeCluster",&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "kafka-cluster:AlterCluster",&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "kafka-cluster:DescribeClusterDynamicConfiguration",&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "kafka-cluster:AlterClusterDynamicConfiguration",&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "kafka-cluster:DescribeTopic",&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "kafka-cluster:CreateTopic",&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "kafka-cluster:AlterTopic",&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "kafka-cluster:DescribeTopicDynamicConfiguration",&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "kafka-cluster:AlterTopicDynamicConfiguration",&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "kafka-cluster:WriteData",&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "kafka-cluster:WriteDataIdempotently",&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "kafka-cluster:ReadData",&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "kafka-cluster:DescribeGroup",&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "kafka-cluster:AlterGroup"&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ],&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "Resource": [&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "arn:aws:kafka:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNT_ID&amp;gt;:cluster/&amp;lt;MSK_CLUSTER_NAME&amp;gt;*/*",&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "arn:aws:kafka:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNT_ID&amp;gt;:topic/&amp;lt;MSK_CLUSTER_NAME&amp;gt;/*",&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "arn:aws:kafka:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNT_ID&amp;gt;:group/&amp;lt;MSK_CLUSTER_NAME&amp;gt;*/*"&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ]&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; },&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; {&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "Sid": "CloudWatchLogsAccess",&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "Effect": "Allow",&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "Action": [&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "logs:CreateLogStream",&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "logs:PutLogEvents",&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "logs:DescribeLogStreams"&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; ],&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "Resource": "&amp;lt;MSK_REPLICATOR_LOG_GROUP_ARN&amp;gt;"&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp; ]&amp;nbsp;
}
&lt;/code&gt;&lt;/pre&gt; 
&lt;h4&gt;Create the replicator for external to MSK replication&lt;/h4&gt; 
&lt;p&gt;Use the AWS CLI, API, or Console to create your replicator. Here’s an example using the AWS CLI:&lt;/p&gt; 
&lt;pre&gt;&lt;code class="lang-bash"&gt;aws kafka create-replicator \
&amp;nbsp; --replicator-name external-to-msk \
&amp;nbsp; --service-execution-role-arn "arn:aws:iam::123456789012:role/MSKReplicatorRole" \
&amp;nbsp; --kafka-clusters file://./kafka-clusters.json \
&amp;nbsp; --replication-info-list file://./replication-info.json \
&amp;nbsp; --log-delivery file://./log-delivery.json \
&amp;nbsp; --region us-east-1&lt;/code&gt;&lt;/pre&gt; 
&lt;p&gt;The &lt;code&gt;kafka-clusters.json&lt;/code&gt; file defines the source and target Kafka cluster connection information, &lt;code&gt;replication-info.json&lt;/code&gt; specifies which topics to replicate and how to handle consumer group offset synchronization, and &lt;code&gt;log-delivery.json&lt;/code&gt; specifies the CloudWatch logging configuration. The following tables describe the required parameters:&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;&lt;em&gt;CLI inputs:&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt; 
&lt;table class="styled-table" border="1px" cellpadding="10px"&gt; 
 &lt;tbody&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;CLI Parameter&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Description&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Example&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;replicator-name&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;The name of the replicator&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;external-to-msk&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;service-execution-role-arn&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;The ARN for the service execution IAM role you created&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;arn:aws:iam::123456789012:role/MSKReplicatorRole&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;kafka-clusters&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;The Kafka cluster connection info&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;See below&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;replication-info-list&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;The replication configuration&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;See below&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;log-delivery&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;The logging configuration&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;See below&lt;/td&gt; 
  &lt;/tr&gt; 
 &lt;/tbody&gt; 
&lt;/table&gt; 
&lt;p&gt;&lt;strong&gt;&lt;em&gt;Key &lt;code&gt;kafka-clusters.json&lt;/code&gt; inputs:&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt; 
&lt;table class="styled-table" border="1px" cellpadding="10px"&gt; 
 &lt;tbody&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;CLI Parameter&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Description&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Example&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;ApacheKafkaClusterId&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;The cluster ID retrieved in Step 2&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;lkc-abc123&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;RootCaCertificate&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;The Secrets Manager ARN containing the public CA certificate and intermediate CA chain&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;arn:aws:secretsmanager:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNT_ID&amp;gt;:secret:my-cert&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;MskClusterArn&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;The ARN for the MSK Express cluster&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;arn:aws:kafka:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNT_ID&amp;gt;:cluster/my-cluster/abc-123&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;SecretArn&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;The Secrets Manager ARN containing the SASL/SCRAM username and password&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;arn:aws:secretsmanager:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNT_ID&amp;gt;:secret:my-creds&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;SecurityGroupIds&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;The security group IDs for MSK Replicator&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;sg-0123456789abcdef0&lt;/td&gt; 
  &lt;/tr&gt; 
 &lt;/tbody&gt; 
&lt;/table&gt; 
&lt;p&gt;&lt;strong&gt;&lt;em&gt;Key &lt;code&gt;replication-info.json&lt;/code&gt; inputs:&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt; 
&lt;table class="styled-table" border="1px" cellpadding="10px"&gt; 
 &lt;tbody&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;CLI Parameter&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Description&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Example&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;TargetCompressionType&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;The compression type to use for replicating data&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;LZ4&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;TopicsToReplicate&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;The list of topics to replicate (use [“.*”] for all topics)&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;[“my-topic”]&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;ConsumerGroupsToReplicate&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;The list of consumer groups to replicate&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;[“my-group”]&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;StartingPosition&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;The point in the Kafka topics to begin replication from (either EARLIEST or LATEST)&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;EARLIEST&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;ConsumerGroupOffsetSyncMode&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Whether or not to use enhanced bidirectional consumer group offset synchronization&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;ENHANCED&lt;/td&gt; 
  &lt;/tr&gt; 
 &lt;/tbody&gt; 
&lt;/table&gt; 
&lt;p&gt;Note that &lt;code&gt;startingPosition&lt;/code&gt; is set to &lt;code&gt;EARLIEST&lt;/code&gt; in the configuration below, which means the replicator begins reading from the oldest available offset on each topic. This is the recommended setting for migrations to avoid data loss.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;&lt;em&gt;Key &lt;code&gt;log-delivery.json&lt;/code&gt; inputs:&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt; 
&lt;table class="styled-table" border="1px" cellpadding="10px"&gt; 
 &lt;tbody&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;CLI Parameter&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Description&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Example&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Enabled&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Allows you to enable CloudWatch logging&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;true&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;LogGroup&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;The CloudWatch logs log group name to log to&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;/msk/replicator/my-replicator&lt;/td&gt; 
  &lt;/tr&gt; 
 &lt;/tbody&gt; 
&lt;/table&gt; 
&lt;p&gt;Additional log delivery methods for &lt;a href="https://aws.amazon.com/s3/" target="_blank" rel="noopener noreferrer"&gt;Amazon S3&lt;/a&gt; and &lt;a href="https://aws.amazon.com/firehose/" target="_blank" rel="noopener noreferrer"&gt;Amazon Data Firehose&lt;/a&gt; are supported. In this post, we use CloudWatch logging.&lt;/p&gt; 
&lt;p&gt;The configs should look like the following for external to MSK replication.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;&lt;code&gt;kafka-clusters.json:&lt;/code&gt;&lt;/strong&gt;&lt;/p&gt; 
&lt;pre&gt;&lt;code class="lang-json"&gt;[&amp;nbsp;
&amp;nbsp; {&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp; "ApacheKafkaCluster": {&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "ApacheKafkaClusterId": "lkc-abc123",&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "BootstrapBrokerString": "broker1.example.com:9096"&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp; },&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp; "ClientAuthentication": {&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "SaslScram": {&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "Mechanism": "SHA512",&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "SecretArn": "arn:aws:secretsmanager:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNT_ID&amp;gt;:secret:my-creds"&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp; },&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp; "EncryptionInTransit": {&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "EncryptionType": "TLS",&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "RootCaCertificate": "arn:aws:secretsmanager:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNT_ID&amp;gt;:secret:my-cert"&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp; }&amp;nbsp;
&amp;nbsp; },&amp;nbsp;
&amp;nbsp; {&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp; "AmazonMskCluster": {&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "MskClusterArn": "arn:aws:kafka:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNT_ID&amp;gt;:cluster/my-cluster/abc-123"&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp; },&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp; "VpcConfig": {&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "SecurityGroupIds": ["sg-0123456789abcdef0"],&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "SubnetIds": ["subnet-abc123", "subnet-abc124", "subnet-abc125"]&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp; }&amp;nbsp;
&amp;nbsp; }&amp;nbsp;
]&amp;nbsp;&lt;/code&gt;&lt;/pre&gt; 
&lt;p&gt;&lt;strong&gt;&lt;code&gt;replication-info.json:&amp;nbsp;&lt;/code&gt;&lt;/strong&gt;&lt;/p&gt; 
&lt;pre&gt;&lt;code class="lang-json"&gt;[&amp;nbsp;
&amp;nbsp; {&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp; "SourceKafkaClusterId": "lkc-abc123",&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp; "TargetKafkaClusterArn": "arn:aws:kafka:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNT_ID&amp;gt;:cluster/my-cluster/abc-123",&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp; "TargetCompressionType": "LZ4",&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp; "TopicReplication": {&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "TopicsToReplicate": ["my-topic"],&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "CopyTopicConfigurations": true,&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "CopyAccessControlListsForTopics": true,&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "DetectAndCopyNewTopics": true,&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "StartingPosition": {"Type": "EARLIEST"},&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "TopicNameConfiguration": {"Type": "IDENTICAL"}&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp; },&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp; "ConsumerGroupReplication": {&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "ConsumerGroupsToReplicate": ["my-group"],&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "SynchroniseConsumerGroupOffsets": true,&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "DetectAndCopyNewConsumerGroups": true,&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "ConsumerGroupOffsetSyncMode": "ENHANCED"&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp; }&amp;nbsp;
&amp;nbsp; }&amp;nbsp;
]&amp;nbsp;&lt;/code&gt;&lt;/pre&gt; 
&lt;p&gt;&lt;strong&gt;&lt;code&gt;log-delivery.json:&amp;nbsp;&lt;/code&gt;&lt;/strong&gt;&lt;/p&gt; 
&lt;pre&gt;&lt;code class="lang-json"&gt;{&amp;nbsp;
&amp;nbsp; "ReplicatorLogDelivery": {
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "CloudWatchLogs": {
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "Enabled": true,&amp;nbsp;
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; "LogGroup": "&amp;lt;LOG_GROUP_NAME&amp;gt;"
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp; }
&amp;nbsp; }&amp;nbsp;
}&lt;/code&gt;&lt;/pre&gt; 
&lt;h4&gt;Configure bidirectional replication from MSK to the external cluster&lt;/h4&gt; 
&lt;p&gt;To enable bidirectional replication, create a second replicator that replicates in the opposite direction. Use the same IAM role and network configuration from Step 4, but swap the source and target. Replace &lt;code&gt;SourceKafkaClusterId&lt;/code&gt; with &lt;code&gt;TargetKafkaClusterId&lt;/code&gt; and &lt;code&gt;TargetKafkaClusterArn&lt;/code&gt; with &lt;code&gt;SourceKafkaClusterArn&lt;/code&gt; in a new &lt;code&gt;msk-to-external-replication-info.json&lt;/code&gt; file:&lt;/p&gt; 
&lt;pre&gt;&lt;code class="lang-bash"&gt;aws kafka create-replicator \
  --replicator-name msk-to-external \
  --service-execution-role-arn "arn:aws:iam::123456789012:role/MSKReplicatorRole" \
  --kafka-clusters file:///./kafka-clusters.json \
  --replication-info-list file:///./msk-to-external-replication-info.json \
  --log-delivery file:///./log-delivery.json \
  --region us-east-1&lt;/code&gt;&lt;/pre&gt; 
&lt;h2&gt;Monitoring replication health&lt;/h2&gt; 
&lt;p&gt;Monitor your replication using Amazon CloudWatch metrics. Three key metrics to understand are &lt;code&gt;MessageLag&lt;/code&gt;, &lt;code&gt;SumOffsetLag&lt;/code&gt;, and &lt;code&gt;ReplicationLatency&lt;/code&gt;. &lt;code&gt;MessageLag&lt;/code&gt; measures how far behind the replicator is from the external cluster in terms of messages not yet replicated, while &lt;code&gt;SumOffsetLag&lt;/code&gt; measures how far behind a consumer group is from the latest message in a topic. &lt;code&gt;ReplicationLatency&lt;/code&gt; is the amount of latency between the source and target clusters in data replication. When the three reach a sustained low level, your clusters are fully synchronized for both data and consumer group offsets.&lt;/p&gt; 
&lt;p&gt;To troubleshoot MSK Replicator replication or errors, use the CloudWatch logs to get more details about the health of the replicator. MSK Replicator logs status and troubleshooting information which can be helpful in diagnosing issues like connectivity, authentication, and SSL errors.&lt;/p&gt; 
&lt;p&gt;Note that the replication is asynchronous, so there will be some lag during replication. The lag will reach zero once a client is shut down during migration to the target cluster. This takes about 30 seconds under normal operations, allowing a low downtime migration without data loss. If your lag is continually increasing or does not reach a sustained low level, this indicates that you have insufficient partitions for high-throughput replication. Refer to &lt;a href="https://docs.aws.amazon.com/msk/latest/developerguide/msk-replicator-troubleshooting.html"&gt;Troubleshoot MSK Replicator&lt;/a&gt; for more information on troubleshooting replication throughput and lag.&lt;/p&gt; 
&lt;p&gt;Key metrics include:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;MessageLag –&lt;/strong&gt; Monitors the sync between the MSK Replicator and the source cluster. MessageLag indicates the lag between the messages produced to the source cluster and messages consumed by the replicator. It is not the lag between the source and target cluster.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;ReplicationLatency –&lt;/strong&gt; Time taken for records to replicate from source to target cluster (ms)&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;ReplicatorThroughput –&lt;/strong&gt; Average number of bytes replicated per second&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;ReplicatorFailure –&lt;/strong&gt; Number of failures the replicator is experiencing&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;KafkaClusterPingSuccessCount –&lt;/strong&gt; Connection health indicator (1 = healthy, 0 = unhealthy)&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;ConsumerGroupCount –&lt;/strong&gt; Total consumer groups being synchronized&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;ConsumerGroupOffsetSyncFailure –&lt;/strong&gt; Failures during offset synchronization&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;AuthError –&lt;/strong&gt; Number of connections with failed authentication per second, by cluster&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;ThrottleTime –&lt;/strong&gt; Average time in ms a request was throttled by brokers, by cluster&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;SumOffsetLag –&lt;/strong&gt; Aggregated offset lag across partitions for a consumer group on a topic (MSK cluster-level metric)&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;For more details on these metrics, see the &lt;a href="https://docs.aws.amazon.com/msk/latest/developerguide/msk-replicator-monitor.html" target="_blank" rel="noopener noreferrer"&gt;MSK Replicator metrics documentation&lt;/a&gt;.&lt;/p&gt; 
&lt;p&gt;Your applications are ready to migrate when the following conditions are met. For most workloads, you should expect these metrics to stabilize within a few hours of starting replication. High-throughput clusters may take longer depending on topic volume and partition count.&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;ReplicatorFailure&lt;/strong&gt; = 0&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;ConsumerGroupOffsetSyncFailure&lt;/strong&gt; = 0&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;KafkaClusterPingSuccessCount&lt;/strong&gt; = 1 for both source and target clusters&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;MessageLag&lt;/strong&gt; &amp;lt; 1,000 
  &lt;ul&gt; 
   &lt;li&gt;Your sustained lag may be lower or higher depending on your throughput per partition, message size, and other factors&lt;/li&gt; 
   &lt;li&gt;Sustained high message lag usually indicates insufficient partitions for high-throughput replication&lt;/li&gt; 
  &lt;/ul&gt; &lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;ReplicationLatency&lt;/strong&gt; &amp;lt; 90 seconds 
  &lt;ul&gt; 
   &lt;li&gt;Your sustained latency may be lower or higher depending on your throughput per partition, message size, and other factors&lt;/li&gt; 
   &lt;li&gt;Sustained high latency usually indicates insufficient partitions for high-throughput replication&lt;/li&gt; 
  &lt;/ul&gt; &lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;SumOffsetLag&lt;/strong&gt; is at a sustained low level on both clusters 
  &lt;ul&gt; 
   &lt;li&gt;Offset values on the two clusters may not be numerically identical.&lt;/li&gt; 
   &lt;li&gt;MSK Replicator translates offsets between clusters so that consumers resume from the correct position, but the raw offset numbers can differ due to how offset translation works. What matters is that SumOffsetLag is at a sustained low level.&lt;/li&gt; 
  &lt;/ul&gt; &lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;ConsumerGroupCount&lt;/strong&gt; (MSK) = Expected count (external cluster) 
  &lt;ul&gt; 
   &lt;li&gt;If ConsumerGroupCount is zero or does not match the expected count, then there is an issue in the Replicator configuration or a permissions issue preventing consumer group synchronization&lt;/li&gt; 
  &lt;/ul&gt; &lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;Migrating your applications&lt;/h2&gt; 
&lt;p&gt;With bidirectional consumer offset synchronization, you can migrate your producers and consumers regardless of order. Start by monitoring replication metrics until they reach the target values described in the previous section. Then migrate your applications (producers or consumers) to use the MSK Express cluster endpoints and verify that they are producing and consuming as expected. If you encounter issues, you can roll back by switching applications back to the external cluster. The consumer offset synchronization makes sure that your applications resume from their last committed position regardless of which cluster they connect to.&lt;/p&gt; 
&lt;p&gt;For a comprehensive, hands-on walkthrough of the end-to-end migration process, explore the &lt;a href="https://catalog.workshops.aws/msk-migration-lab" target="_blank" rel="noopener noreferrer"&gt;MSK Migration Workshop&lt;/a&gt;, which provides step-by-step guidance for migrating your Kafka workloads to Amazon MSK.&lt;/p&gt; 
&lt;h2&gt;Security considerations&lt;/h2&gt; 
&lt;p&gt;MSK Replicator uses SASL/SCRAM authentication with SSL encryption for secure data transfer between your external cluster and AWS. The solution supports both publicly trusted certificates and private or self-signed certificates. Credentials are stored securely in &lt;a href="https://docs.aws.amazon.com/secretsmanager/latest/userguide/intro.html" target="_blank" rel="noopener noreferrer"&gt;AWS Secrets Manager&lt;/a&gt;, and the target MSK Express cluster uses &lt;a href="https://docs.aws.amazon.com/msk/latest/developerguide/iam-access-control.html" target="_blank" rel="noopener noreferrer"&gt;IAM authentication&lt;/a&gt; for access control.&lt;/p&gt; 
&lt;p&gt;When configuring security, keep the following in mind:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Make sure that the IAM role you create in Step 4 follows the principle of least privileges. Only attach &lt;code&gt;AWSMSKReplicatorExecutionRole&lt;/code&gt; and an IAM policy for Secrets Manager with least-privileges access to read secret values and avoid adding broader permissions.&lt;/li&gt; 
 &lt;li&gt;Verify that your Secrets Manager secret is encrypted with an AWS KMS key that the MSK Replicator service execution role has permission to decrypt.&lt;/li&gt; 
 &lt;li&gt;Confirm that the security groups assigned to MSK Replicator allow outbound traffic to your external cluster’s broker ports (typically 9096 for SASL/SCRAM with TLS) and to the MSK Express cluster.&lt;/li&gt; 
 &lt;li&gt;Rotate your SASL/SCRAM credentials periodically and update the corresponding Secrets Manager secret. MSK Replicator picks up the new credentials automatically on the next connection attempt.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;Under the &lt;a href="https://aws.amazon.com/compliance/shared-responsibility-model/" target="_blank" rel="noopener noreferrer"&gt;AWS shared responsibility model&lt;/a&gt;, AWS is responsible for securing the underlying infrastructure that runs MSK Replicator, including the compute, storage, and networking resources. You are responsible for configuring authentication mechanisms (SASL/SCRAM), managing credentials in AWS Secrets Manager, configuring network security (security groups and VPC settings), implementing IAM policies following least privilege, and rotating credentials. For more information, see &lt;a href="https://docs.aws.amazon.com/msk/latest/developerguide/security.html" target="_blank" rel="noopener noreferrer"&gt;Security in Amazon MSK&lt;/a&gt; in the Amazon MSK Developer Guide.&lt;/p&gt; 
&lt;h2&gt;Cleanup&lt;/h2&gt; 
&lt;p&gt;To avoid ongoing charges, delete the resources you created during this walkthrough. Start by deleting the replicators first, because they depend on the other resources:&lt;/p&gt; 
&lt;p&gt;&lt;code&gt;aws kafka delete-replicator --replicator-arn &amp;lt;replicator-arn&amp;gt;&lt;/code&gt;&lt;/p&gt; 
&lt;p&gt;After both replicators are deleted, you can remove the following resources if they were created solely for this walkthrough:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;The MSK Express cluster (deleting a cluster also removes its stored data, so verify that your applications have fully migrated before proceeding)&lt;/li&gt; 
 &lt;li&gt;The Secrets Manager secrets containing your SASL/SCRAM credentials and certificates&lt;/li&gt; 
 &lt;li&gt;The IAM role and policies created for MSK Replicator&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;You can verify that a replicator has been fully deleted by running &lt;code&gt;aws kafka list-replicators&lt;/code&gt; and confirming it no longer appears in the output.&lt;/p&gt; 
&lt;h2&gt;Conclusion&lt;/h2&gt; 
&lt;p&gt;Amazon MSK Replicator simplifies the process of migrating to Amazon MSK Express brokers and establishes hybrid Kafka architectures. The fully managed service alleviates the operational complexity of managing replication while bidirectional consumer offset synchronization enables flexible, low-risk application migration.&lt;/p&gt; 
&lt;h3&gt;Next Steps&lt;/h3&gt; 
&lt;p&gt;To get started using MSK Replicator to migrate applications to MSK Express brokers, use the &lt;a href="https://catalog.workshops.aws/msk-migration-lab" target="_blank" rel="noopener noreferrer"&gt;MSK Migration Workshop&lt;/a&gt; for a hands-on, end-to-end migration walkthrough. The &lt;a href="https://docs.aws.amazon.com/msk/latest/developerguide/msk-replicator.html" target="_blank" rel="noopener noreferrer"&gt;Amazon MSK Replicator documentation&lt;/a&gt; includes detailed configuration details to help configure MSK Replicator for your use case. From there, use MSK Replicator to migrate your Apache Kafka workloads to MSK Express broker.&lt;/p&gt; 
&lt;p&gt;Once your migration is complete, consider exploring multi-region replication patterns for disaster recovery, or integrating your MSK Express cluster with AWS analytics services such as &lt;a href="https://aws.amazon.com/firehose/" target="_blank" rel="noopener noreferrer"&gt;Amazon Data Firehose&lt;/a&gt; and &lt;a href="https://aws.amazon.com/athena/" target="_blank" rel="noopener noreferrer"&gt;Amazon Athena&lt;/a&gt;. If you need help planning your migration, reach out to your AWS account team, &lt;a href="https://aws.amazon.com/support/" target="_blank" rel="noopener noreferrer"&gt;AWS Support&lt;/a&gt; or &lt;a href="https://aws.amazon.com/professional-services/" target="_blank" rel="noopener noreferrer"&gt;AWS Professional Services&lt;/a&gt;.&lt;/p&gt; 
&lt;hr style="width: 80%"&gt; 
&lt;h2&gt;About the authors&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignnone wp-image-90062 size-thumbnail" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/10/ankitams-100x133.jpg" alt="" width="100" height="133"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Ankita Mishra&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/ankitamishra05" target="_blank" rel="noopener"&gt;Ankita&lt;/a&gt; is a Product Manager for Amazon Managed Streaming for Apache Kafka. She works closely with AWS customers to understand their needs for real-time analytics and high throughput, low latency streaming workloads. Working backwards from their needs, she helps drive the MSK roadmap and deliver new innovations that help AWS customers focus on building novel streaming applications.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignnone size-full wp-image-89475" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/03/24/bdb-5775-mmehrten-headshot.png" alt="" width="100" height="107"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Mazrim Mehrtens&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/mmehrtens/" target="_blank" rel="noopener"&gt;Mazrim&lt;/a&gt; is a Sr. Specialist Solutions Architect for messaging and streaming workloads. Mazrim works with customers to build and support systems that process and analyze terabytes of streaming data in real time, run enterprise Machine Learning pipelines, and create systems to share data across teams seamlessly with varying data toolsets and software stacks.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
					
			
		
		
			</item>
		<item>
		<title>Building unified data pipelines with Apache Iceberg and Apache Flink</title>
		<link>https://aws.amazon.com/blogs/big-data/building-unified-data-pipelines-with-apache-iceberg-and-apache-flink/</link>
					
		
		<dc:creator><![CDATA[Nikhil Jha]]></dc:creator>
		<pubDate>Mon, 20 Apr 2026 16:59:46 +0000</pubDate>
				<category><![CDATA[Advanced (300)]]></category>
		<category><![CDATA[Amazon Managed Service for Apache Flink]]></category>
		<category><![CDATA[AWS Big Data]]></category>
		<category><![CDATA[AWS Glue]]></category>
		<category><![CDATA[Financial Services]]></category>
		<category><![CDATA[Intermediate (200)]]></category>
		<category><![CDATA[Technical How-to]]></category>
		<guid isPermaLink="false">95a314f2dba6484ebfd5ac609fa7a195b4550f34</guid>

					<description>In this post, you build a unified pipeline using Apache Iceberg and Amazon Managed Service for Apache Flink that replaces the dual-pipeline approach. This walkthrough is for intermediate AWS users who are comfortable with Amazon Simple Storage Service (Amazon S3) and AWS Glue Data Catalog but new to streaming from Apache Iceberg tables.</description>
										<content:encoded>&lt;p&gt;You can process real-time data from your data lake with &lt;a href="https://docs.aws.amazon.com/managed-flink/" target="_blank" rel="noopener noreferrer"&gt;Amazon Managed Service for Apache Flink&lt;/a&gt; without maintaining two separate pipelines. Yet many teams do exactly that, and the cost adds up fast. In this post, you build a unified pipeline using Apache Iceberg and Amazon Managed Service for Apache Flink that replaces the dual-pipeline approach. This walkthrough is for intermediate AWS users who are comfortable with &lt;a href="https://docs.aws.amazon.com/s3/" target="_blank" rel="noopener noreferrer"&gt;Amazon Simple Storage Service (Amazon S3)&lt;/a&gt; and &lt;a href="https://docs.aws.amazon.com/glue/latest/dg/catalog-and-crawler.html" target="_blank" rel="noopener noreferrer"&gt;AWS Glue Data Catalog&lt;/a&gt; but new to streaming from Apache Iceberg tables.&lt;/p&gt; 
&lt;h2&gt;The dual-pipeline problem&lt;/h2&gt; 
&lt;p&gt;&lt;img loading="lazy" class="wp-image-89997 size-full aligncenter" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/07/BDB-5291-image-1-1.png" alt="Traditional dual-pipeline architecture with separate batch and streaming paths, each with its own ingestion, processing, storage, and serving layers, processing the same source data independently." width="962" height="495"&gt;&lt;/p&gt; 
&lt;p&gt;This dual-pipeline approach creates three problems:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Double the infrastructure costs.&lt;/strong&gt; You run and pay for two separate compute environments, two storage layers, and two sets of monitoring. For example, if you’re spending $10,000/month on separate streaming and batch infrastructure, a meaningful portion of that spend is pure duplication.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Data synchronization issues.&lt;/strong&gt; Your batch and streaming consumers read from different copies of the data, processed at different times. When a transaction shows up in your real-time dashboard but not in your batch report (or vice versa), debugging the inconsistency takes hours.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Operational complexity.&lt;/strong&gt; Two pipelines mean two deployment processes, two failure modes to monitor, and two sets of schema evolution to manage. Your team spends time reconciling systems instead of building features.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;Where this pattern fits&lt;/h2&gt; 
&lt;p&gt;Before diving into the implementation, consider whether streaming from your data lake is the right approach for your use case.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Streaming from Apache Iceberg tables works well when&lt;/strong&gt; you need data available within seconds to minutes and you query recent data frequently, multiple times per hour. Common scenarios include:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Operational data stores&lt;/strong&gt; — Stream customer profile updates to serve downstream applications like recommendation engines. When a customer updates their preferences, those changes reach your operational data store within seconds.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Fraud detection&lt;/strong&gt; — Stream transactions for immediate analysis. Start with a 3-second monitor interval and adjust based on your detection accuracy needs.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Live dashboards&lt;/strong&gt; — Power real-time analytics directly from your lake. This is the strongest starting point if you’re evaluating the approach for the first time, because the feedback loop is immediate and straightforward to validate.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Event-driven architectures&lt;/strong&gt; — Trigger downstream processes based on data changes in your Apache Iceberg tables.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;&lt;strong&gt;Batch processing remains more cost-effective when&lt;/strong&gt; you process data once per day or less, or you primarily query historical data. Batch queries on Apache Iceberg tables cost less because they don’t require a continuous Apache Flink runtime.&lt;/p&gt; 
&lt;h2&gt;How Apache Iceberg solves this&lt;/h2&gt; 
&lt;p&gt;Apache Iceberg’s snapshot-based architecture removes the need for a separate streaming pipeline. Think of snapshots like Git commits for your data. Each time you write data to your Iceberg table, Iceberg creates a new snapshot that points to the new data files while preserving references to existing files. Apache Flink reads only the changes between snapshots (the new files that arrived after the last checkpoint), rather than scanning the entire table. Atomicity, Consistency, Isolation, Durability (ACID) transactions prevent your concurrent reads and writes from producing partial or inconsistent results. For example, if your batch extract, transform, and load (ETL) job is writing 10,000 records while your Flink application is reading, ACID transactions mean that your streaming query sees either the complete batch of 10,000 records or none of them, not a partial set that could skew your analytics.&lt;/p&gt; 
&lt;p&gt;The result is a single pipeline that handles both real-time and batch access from the same data, through the same storage layer, with the same schema.&lt;/p&gt; 
&lt;h2&gt;Solution architecture&lt;/h2&gt; 
&lt;p&gt;Your architecture uses four AWS services and one open source table format working together. The following diagram shows how these components connect, replacing the dual-pipeline pattern shown earlier with a single unified flow.&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="size-full wp-image-89963 aligncenter" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/07/BDB-5291-image-2.png" alt="Unified pipeline architecture with data flowing from Amazon S3 through Apache Iceberg tables, with AWS Glue Data Catalog managing metadata, and Amazon Managed Service for Apache Flink consuming incremental snapshots for near real-time processing." width="1101" height="581"&gt;&lt;/p&gt; 
&lt;p&gt;Your source data lands in Amazon S3 as Apache Iceberg table files. AWS Glue Data Catalog tracks the metadata and schema. When new data arrives, Apache Iceberg creates a new snapshot that your application detects. Your Flink application monitors these snapshots and processes new records incrementally, reading only the files that arrived after the last checkpoint, not the entire table.&lt;/p&gt; 
&lt;p&gt;You use four main components:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Amazon S3&lt;/strong&gt; — Foundational storage layer for your data lake&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Data Catalog&lt;/strong&gt; — Metadata and schema management for Apache Iceberg tables&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Apache Iceberg&lt;/strong&gt; — Table format with snapshot-based streaming capabilities&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Amazon Managed Service for Apache Flink&lt;/strong&gt; — Stream processing and incremental consumption&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;Important notices&lt;/h2&gt; 
&lt;p&gt;Before implementing this solution, evaluate these risks for your environment:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Data security:&lt;/strong&gt; Streaming from data lakes exposes data to additional processing systems. Classify your data before implementation—customer profile updates and transaction data typically contain personally identifiable information (PII) and treat them as confidential. Apply encryption at rest and in transit for confidential data. Key risks include unauthorized data access through misconfigured Amazon S3 bucket policies or overly permissive IAM roles. Mitigations: use the resource-scoped IAM policy and TLS-enforcing bucket policy provided in the Security section.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Data integrity:&lt;/strong&gt; Misconfigured checkpoints or schema changes during streaming can lead to data inconsistency. Mitigations: enable exactly-once processing semantics and test schema evolution in a non-production environment first.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Compliance:&lt;/strong&gt; Verify that real-time data processing meets your regulatory requirements. For workloads subject to HIPAA, confirm that you use HIPAA Eligible Services and have a Business Associate Agreement (BAA) with AWS. For PCI-DSS or GDPR workloads, review the relevant compliance documentation on the AWS Compliance page. Implement data retention policies that comply with your regulatory framework.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Cost:&lt;/strong&gt; Nearly continuous streaming incurs ongoing compute costs. Monitor usage to avoid unexpected charges. Cost estimates in this post are based on pricing as of March 2026 and might change. Verify current pricing on the relevant AWS service pricing pages.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Operational:&lt;/strong&gt; Pipeline failures might impact downstream systems. Implement monitoring and alerting before running in production.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;Prerequisites&lt;/h2&gt; 
&lt;p&gt;Before you begin, make sure that you have the following in place. This walkthrough assumes intermediate Python skills (comfortable with functions, error handling, and environment variables), basic Apache Flink concepts (streaming compared to batch processing), and basic &lt;a href="https://docs.aws.amazon.com/iam/" target="_blank" rel="noopener noreferrer"&gt;AWS Identity and Access Management (AWS IAM)&lt;/a&gt; knowledge (creating roles and attaching policies). Plan for approximately 90–120 minutes, including setup, implementation, and testing. First-time setup might take longer as you download dependencies and configure AWS resources. Expected AWS costs: approximately $5–10 if you complete the walkthrough within 2 hours and clean up resources immediately afterward. The primary cost driver is Amazon Managed Service for Apache Flink runtime ($0.11/hour per Kinesis Processing Unit (KPU)). You can minimize costs by stopping your application when not in use.&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;An AWS account with AWS IAM permissions for: &lt;code&gt;s3:GetObject&lt;/code&gt;, &lt;code&gt;s3:PutObject&lt;/code&gt;, &lt;code&gt;s3:ListBucket&lt;/code&gt; on your data bucket; &lt;code&gt;glue:GetDatabase&lt;/code&gt;, &lt;code&gt;glue:GetTable&lt;/code&gt; for catalog access; and &lt;code&gt;flink:CreateApplication&lt;/code&gt;, &lt;code&gt;flink:StartApplication&lt;/code&gt; for Amazon Managed Service for Apache Flink&lt;/li&gt; 
 &lt;li&gt;An existing Amazon S3 bucket for your data lake&lt;/li&gt; 
 &lt;li&gt;An AWS Glue Data Catalog database configured&lt;/li&gt; 
 &lt;li&gt;Apache Flink 1.19.1 installed locally&lt;/li&gt; 
 &lt;li&gt;Python 3.8 or later&lt;/li&gt; 
 &lt;li&gt;Java 11 or a more recent version&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/cli/" target="_blank" rel="noopener noreferrer"&gt;AWS Command Line Interface (AWS CLI)&lt;/a&gt; configured with credentials (aws configure)&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h3&gt;Required Java Archive (JAR) dependencies&lt;/h3&gt; 
&lt;p&gt;You need multiple JAR files because your Flink application coordinates between different systems—Amazon S3 for storage, AWS Glue for metadata, Hadoop for file operations, and Apache Iceberg for the table format. Each JAR handles a specific part of this integration. Missing even one causes ClassNotFoundException errors at runtime.&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;iceberg-flink-runtime-1.19-1.6.1.jar — Core Apache Iceberg integration with Apache Flink&lt;/li&gt; 
 &lt;li&gt;iceberg-aws-bundle-1.6.1.jar — AWS-specific Apache Iceberg functionality for Amazon S3 and AWS Glue&lt;/li&gt; 
 &lt;li&gt;flink-s3-fs-hadoop-1.19.1.jar — Provides Apache Flink read and write access to Amazon S3&lt;/li&gt; 
 &lt;li&gt;flink-sql-connector-hive-3.1.3_2.12-1.19.1.jar — Hive metastore connector for catalog compatibility&lt;/li&gt; 
 &lt;li&gt;hadoop-common-3.4.0.jar — Core Hadoop libraries required by Apache Iceberg&lt;/li&gt; 
 &lt;li&gt;flink-shaded-hadoop-2-uber-2.8.3-10.0.jar — Repackaged Hadoop dependencies that avoid version conflicts with Apache Flink&lt;/li&gt; 
 &lt;li&gt;hadoop-hdfs-client-3.4.0.jar — Hadoop Distributed File System (HDFS) client libraries for file system operations&lt;/li&gt; 
 &lt;li&gt;flink-json-1.19.1.jar — JSON format support for Apache Flink&lt;/li&gt; 
 &lt;li&gt;hadoop-aws-3.4.0.jar — Hadoop integration with AWS services&lt;/li&gt; 
 &lt;li&gt;hadoop-client-3.4.0.jar — Hadoop client libraries&lt;/li&gt; 
 &lt;li&gt;aws-java-sdk-bundle-1.12.261.jar — AWS SDK for authentication and service access&lt;/li&gt; 
&lt;/ul&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;jars = [
    "flink-s3-fs-hadoop-1.19.1.jar",
    "flink-sql-connector-hive-3.1.3_2.12-1.19.1.jar",
    "hadoop-common-3.4.0.jar",
    "flink-shaded-hadoop-2-uber-2.8.3-10.0.jar",
    "iceberg-flink-runtime-1.19-1.6.1.jar",
    "iceberg-aws-bundle-1.6.1.jar",
    "hadoop-hdfs-client-3.4.0.jar",
    "flink-json-1.19.1.jar",
    "hadoop-aws-3.4.0.jar",
    "hadoop-client-3.4.0.jar",
    "aws-java-sdk-bundle-1.12.261.jar"
]&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;h2&gt;Technical implementation&lt;/h2&gt; 
&lt;p&gt;The sample code in this post is available under the MIT-0 license.This section walks you through building the streaming pipeline step by step. You create a single Python file, iceberg_streaming.py, with three functions that run in sequence. Your main() function calls them in order: set up the Apache Flink environment, register the Data Catalog, then start the streaming query.&lt;/p&gt; 
&lt;h3&gt;Set up your Apache Flink environment&lt;/h3&gt; 
&lt;p&gt;To prepare your Apache Flink environment:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;Download the required JAR files listed in the prerequisites section.&lt;/li&gt; 
 &lt;li&gt;Place the JAR files in a lib directory in your project folder.&lt;/li&gt; 
 &lt;li&gt;Configure your &lt;code&gt;HADOOP_CLASSPATH&lt;/code&gt; environment variable to point to the lib directory.&lt;/li&gt; 
 &lt;li&gt;Create your streaming execution environment by adding the following function to &lt;code&gt;iceberg_streaming.py&lt;/code&gt;:&lt;/li&gt; 
&lt;/ol&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-python"&gt;def setup_environment():
    """Configure the Flink streaming runtime."""
    try:
        os.environ['HADOOP_CLASSPATH'] = os.path.join(os.getcwd(), 'lib', '*')
        env = StreamExecutionEnvironment.get_execution_environment()
        env.set_parallelism(1)
        settings = EnvironmentSettings.new_instance().in_streaming_mode().build()
        t_env = StreamTableEnvironment.create(env, settings)
        return t_env
    except Exception as e:
        print(f"Failed to initialize Flink environment: {e}")
        raise&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;ol start="5"&gt; 
 &lt;li&gt;Verify your environment by running flink –version. If the command isn’t found, confirm that Apache Flink 1.19.1 is installed and that your PATH includes the Flink bin directory.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;h3&gt;Configure AWS Glue Data Catalog&lt;/h3&gt; 
&lt;p&gt;To connect your Flink application to Data Catalog:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;Open your &lt;code&gt;iceberg_streaming.py&lt;/code&gt; file.&lt;/li&gt; 
 &lt;li&gt;Add the &lt;code&gt;create_iceberg_source()&lt;/code&gt; function shown in the following section.&lt;/li&gt; 
 &lt;li&gt;Replace the placeholder values with your actual AWS resources before running. These values are static configuration strings, not user input — do not construct them from external or untrusted sources at runtime.&lt;/li&gt; 
 &lt;li&gt;Save the file.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-python"&gt;def create_iceberg_source(t_env):
    """Register the AWS Glue Data Catalog as an Iceberg catalog."""
    try:
        catalog_sql = """
        CREATE CATALOG glue_catalog WITH (
            'type'='iceberg',
            'catalog-impl'='org.apache.iceberg.aws.glue.GlueCatalog',
            'warehouse'='s3://&amp;lt;example-data-lake-bucket&amp;gt;',
            'io-impl'='org.apache.iceberg.aws.s3.S3FileIO',
            'aws.region'='us-east-1',
            'hadoop-conf.fs.s3a.aws.credentials.provider'=
                'com.amazonaws.auth.DefaultAWSCredentialsProviderChain',
            'hadoop-conf.fs.s3a.endpoint'='s3.amazonaws.com',
            'property-version'='1'
        )
        """
        t_env.execute_sql(catalog_sql)
        t_env.use_catalog("glue_catalog")
        t_env.use_database("streaming_db")
    except Exception as e:
        print(f"Failed to configure Iceberg catalog: {e}")
        raise&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;h3&gt;Set up streaming logic&lt;/h3&gt; 
&lt;p&gt;This function configures Apache Flink to monitor your Apache Iceberg table continuously and process new records as they arrive. Checkpointing runs every 10 seconds to track progress—if the job restarts, it resumes from the last checkpoint rather than reprocessing the entire table.Notice the monitor-interval parameter, it controls how frequently Apache Flink checks for new Apache Iceberg snapshots. A 3-second interval provides near real-time processing but generates approximately 1,200 Amazon S3 LIST API calls per hour (at $0.005 per 1,000 requests, roughly $0.04/month per table based on pricing as of March 2026). For less time-sensitive workloads, increase this to 30s to reduce API costs by 90%.Replace &lt;code&gt;customer_events&lt;/code&gt; with the name of your Apache Iceberg table in Data Catalog:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-python"&gt;def process_record(row):
    """Validate and process each record from the stream."""
    try:
        if row is None:
            raise ValueError("Received null row")
        required_fields = ["event_type", "timestamp"]
        for field in required_fields:
            if field not in row:
                raise ValueError(f"Missing required field: {field}")
        # Validate field types and content
        if not isinstance(row.get("event_type"), str) or len(row["event_type"]) &amp;gt; 256:
            raise ValueError("event_type must be a string under 256 characters")
        if not isinstance(row.get("timestamp"), (str, int)):
            raise ValueError("timestamp must be a string or integer")
        # Replace with your business logic
        print(f"Processing record: {row}")
    except ValueError as e:
        print(f"Validation error for record {row}: {e}")
    except Exception as e:
        print(f"Error processing record {row}: {e}")
def stream_data(t_env):
    """Start the streaming query and process results."""
    try:
        configuration = t_env.get_config().get_configuration()
        configuration.set_string("table.dynamic-table-options.enabled", "true")
        configuration.set_string("execution.checkpointing.interval", "10000")
        query = """
        SELECT * FROM customer_events /*+ OPTIONS(
            'streaming'='true',
            'monitor-interval'='3s',
            'table.exec.iceberg.cell-based-snapshot'='true'
        ) */
        """
        table_result = t_env.execute_sql(query)
        with table_result.collect() as results:
            for row in results:
                process_record(row)
    except Exception as e:
        print(f"Streaming query failed: {e}")
        raise&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;h3&gt;Putting it together&lt;/h3&gt; 
&lt;p&gt;Your &lt;code&gt;main()&lt;/code&gt; function calls the three steps in order:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-python"&gt;def main():
    try:
        t_env = setup_environment()
        create_iceberg_source(t_env)
        stream_data(t_env)
    except Exception as e:
        print(f"Pipeline failed: {e}")
        raise
if __name__ == "__main__":
    main()&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Run the pipeline locally:&lt;code&gt;python iceberg_streaming.py&lt;/code&gt;Package the application and submit it to Amazon Managed Service for Apache Flink using the console or the AWS Command Line Interface (AWS CLI).&lt;/p&gt; 
&lt;h2&gt;Running in production&lt;/h2&gt; 
&lt;p&gt;Moving from a local test to a production deployment requires tuning four areas: performance, monitoring, cost, and security. This section covers the key decisions for each.&lt;/p&gt; 
&lt;h3&gt;Performance tuning&lt;/h3&gt; 
&lt;p&gt;Determine your latency requirements before tuning. For fraud detection, you need subsecond processing. For daily reporting dashboards, you can tolerate minutes of delay.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Partition pruning&lt;/strong&gt; reduces the amount of data scanned per query. Proper partitioning can significantly reduce query times for time series data partitioned by date. To implement, create your Apache Iceberg table with partition columns (&lt;code&gt;PARTITIONED BY (date_column) in your CREATE TABLE statement&lt;/code&gt;), then include partition filters in your &lt;code&gt;WHERE clause: WHERE date_column &amp;gt;= CURRENT_DATE - INTERVAL '7' DAY&lt;/code&gt;.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Parallel processing&lt;/strong&gt; matches your data volume and throughput requirements. For most workloads under 10,000 records per second, a parallelism of 1–4 is sufficient. Scale up incrementally and monitor backpressure metrics (indicators that data arrives faster than your pipeline processes it, causing queuing) to find the right setting.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Checkpoint tuning&lt;/strong&gt; balances reliability and latency. Consider how much data you can afford to reprocess after a failure. If you process 1,000 records per second with 10-second checkpoints, a failure means reprocessing up to 10,000 records. When that’s acceptable, 10 seconds works well. For faster recovery or higher volumes, reduce to 5 seconds.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Resource allocation&lt;/strong&gt; — Right-size your Apache Flink cluster to avoid over-provisioning. Monitor CPU and memory utilization during your initial runs and adjust task manager resources accordingly.&lt;/p&gt; 
&lt;h3&gt;Monitoring&lt;/h3&gt; 
&lt;p&gt;Configure your production deployment with the following checkpoint settings. These work well for moderate data volumes (up to 10,000 records per second), providing exactly-once processing semantics. This means that the pipeline processes each record exactly once, even if your application restarts. Adjust the checkpoint interval based on your latency requirements. Add this to your setup_environment() function after creating the table environment.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-css"&gt;config_dict = {
    "execution.checkpointing.interval": "30000",
    "execution.checkpointing.mode": "EXACTLY_ONCE",
    "execution.checkpointing.timeout": "600000",
    "state.backend": "filesystem",
    "state.checkpoints.dir": "s3://&amp;lt;example-data-lake-bucket&amp;gt;/checkpoints"
}&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Use &lt;a href="https://docs.aws.amazon.com/cloudwatch/" target="_blank" rel="noopener noreferrer"&gt;Amazon CloudWatch&lt;/a&gt; to track checkpoint duration, records processed per second, and backpressure metrics. A 10-second checkpoint interval means writing state to Amazon S3 360 times per hour. For a 1 MB state size, that’s approximately 8.6 GB per day in checkpoint storage—at Amazon S3 Standard pricing of $0.023/GB, roughly $0.20/day or $6/month per application based on current pricing. If the checkpoint duration exceeds 50% of your interval, increase the interval or add parallelism.&lt;/p&gt; 
&lt;h3&gt;Cost management&lt;/h3&gt; 
&lt;p&gt;Use &lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/intelligent-tiering.html" target="_blank" rel="noopener noreferrer"&gt;Amazon S3 Intelligent-Tiering&lt;/a&gt; for your Apache Iceberg data files, which typically have predictable access patterns after initial processing. Configure Apache Iceberg’s table expiration to automatically clean up early snapshots. This can reduce storage costs by an estimated 20–30%, though your results vary depending on write frequency and retention policies.&lt;/p&gt; 
&lt;p&gt;Right-size your Apache Flink resources based on actual throughput needs. Start with a minimal configuration and scale up based on observed backpressure and checkpoint duration metrics. Use &lt;a href="https://docs.aws.amazon.com/ec2/" target="_blank" rel="noopener noreferrer"&gt;Amazon Elastic Compute Cloud (Amazon EC2)&lt;/a&gt; Spot Instances where workload interruptions are acceptable, for example, in development and testing environments.&lt;/p&gt; 
&lt;p&gt;Set data retention policies on both your Apache Iceberg tables and checkpoint storage to avoid storing data longer than necessary.&lt;/p&gt; 
&lt;h3&gt;Security&lt;/h3&gt; 
&lt;p&gt;Security is a &lt;a href="https://aws.amazon.com/compliance/shared-responsibility-model/" target="_blank" rel="noopener noreferrer"&gt;shared responsibility&lt;/a&gt; between you and AWS. AWS is responsible for the security of the cloud, including the hardware, software, networking, and facilities that run AWS services. You are responsible for security in the cloud, configuring access controls, encrypting data, and managing your application security. Apply these controls in priority order.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;AWS IAM roles&lt;/strong&gt; — Use AWS IAM roles with least-privilege access, scoped to specific resources. The following example policy restricts permissions to your data lake bucket and AWS Glue catalog:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-css"&gt;{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": ["s3:GetObject", "s3:PutObject"],
      "Resource": "arn:aws:s3:::&amp;lt;example-data-lake-bucket&amp;gt;/*"
    },
    {
      "Effect": "Allow",
      "Action": "s3:ListBucket",
      "Resource": "arn:aws:s3:::&amp;lt;example-data-lake-bucket&amp;gt;",
      "Condition": {
        "StringEquals": {
          "aws:SourceVpce": "&amp;lt;your-vpc-endpoint-id&amp;gt;"
        }
      }
    },
    {
      "Effect": "Allow",
      "Action": ["glue:GetDatabase", "glue:GetTable"],
      "Resource": [
        "arn:aws:glue:us-east-1:&amp;lt;account-id&amp;gt;:catalog",
        "arn:aws:glue:us-east-1:&amp;lt;account-id&amp;gt;:database/streaming_db",
        "arn:aws:glue:us-east-1:&amp;lt;account-id&amp;gt;:table/streaming_db/*"
      ]
    },
    {
      "Effect": "Allow",
      "Action": ["kms:Decrypt", "kms:GenerateDataKey"],
      "Resource": "arn:aws:kms:us-east-1:&amp;lt;account-id&amp;gt;:key/&amp;lt;your-kms-key-id&amp;gt;"
    }
  ]
}&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Scoping permissions to specific Amazon S3 buckets, AWS Glue databases, and AWS Key Management Service (AWS KMS) keys restrict access to only the resources your pipeline requires. Review IAM policies quarterly using the &lt;a href="https://docs.aws.amazon.com/IAM/latest/UserGuide/what-is-access-analyzer.html" target="_blank" rel="noopener noreferrer"&gt;IAM Access Analyzer&lt;/a&gt; to identify and remove unused permissions.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Encryption&lt;/strong&gt; — Configure server-side encryption with &lt;a href="https://docs.aws.amazon.com/kms/" target="_blank" rel="noopener noreferrer"&gt;AWS Key Management Service (AWS KMS)&lt;/a&gt; customer managed keys (SSE-KMS) for your Amazon S3 buckets. Using customer managed keys requires additional review from your security team. Confirm your key management policies, rotation procedures, and access controls before implementation. Enable automatic key rotation annually. For encryption in transit, enforce TLS by adding a bucket policy that denies non-HTTPS access:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-css"&gt;{
  "Effect": "Deny",
  "Principal": "*",
  "Action": ["s3:GetObject", "s3:PutObject", "s3:ListBucket"],
  "Resource": [
    "arn:aws:s3:::&amp;lt;example-data-lake-bucket&amp;gt;/*",
    "arn:aws:s3:::&amp;lt;example-data-lake-bucket&amp;gt;"
  ],
  "Condition": {
    "Bool": { "aws:SecureTransport": "false" }
  }
}&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;&lt;strong&gt;Amazon S3 bucket hardening&lt;/strong&gt; — Enable Block Public Access on your buckets to prevent accidental public exposure:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;aws s3api put-public-access-block \
  --bucket &amp;lt;example-data-lake-bucket&amp;gt; \
  --public-access-block-configuration \
  BlockPublicAcls=true,IgnorePublicAcls=true,BlockPublicPolicy=true,RestrictPublicBuckets=true&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Enable versioning on buckets that store critical data and checkpoints to protect against accidental deletion. For production environments with sensitive data, consider enabling MFA Delete on versioned buckets. Enable S3 server access logging to track requests for security auditing.&lt;/p&gt; 
&lt;p&gt;&lt;a href="https://aws.amazon.com/vpc/" target="_blank" rel="noopener noreferrer"&gt;&lt;strong&gt;Amazon Virtual Private Cloud (Amazon VPC)&lt;/strong&gt;&lt;/a&gt; –Use &lt;a href="https://docs.aws.amazon.com/vpc/latest/privatelink/create-interface-endpoint.html" target="_blank" rel="noopener noreferrer"&gt;Amazon VPC endpoints&lt;/a&gt; for private communication between your Apache Flink cluster and AWS services, removing public internet routing by keeping traffic within the AWS network.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Access logging&lt;/strong&gt; – Enable &lt;a href="https://docs.aws.amazon.com/cloudtrail/" target="_blank" rel="noopener noreferrer"&gt;AWS CloudTrail&lt;/a&gt; data events to log Amazon S3 object-level API calls (GetObject, PutObject) and Data Catalog API calls. Store logs in a separate Amazon S3 bucket with restricted access and enable log file integrity validation. Run regular compliance checks using &lt;a href="https://docs.aws.amazon.com/config/" target="_blank" rel="noopener noreferrer"&gt;AWS Config&lt;/a&gt;.&lt;/p&gt; 
&lt;h3&gt;Operational practices&lt;/h3&gt; 
&lt;p&gt;Set up a continuous integration and continuous deployment (CI/CD) pipeline to automate deployment and testing. Use version control to track schema and code changes. With Apache Iceberg’s schema evolution support, you can add columns without rewriting existing data files. Establish rollback procedures using Apache Iceberg’s snapshot-based architecture, so you can roll back to a previous table state if a bad write corrupts your data.&lt;/p&gt; 
&lt;h2&gt;Troubleshooting&lt;/h2&gt; 
&lt;p&gt;If you run into issues during setup or execution, use the following table to diagnose common errors.&lt;/p&gt; 
&lt;table class="styled-table" border="1px" cellpadding="10px"&gt; 
 &lt;tbody&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;&lt;strong&gt;Error&lt;/strong&gt;&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;&lt;strong&gt;Cause&lt;/strong&gt;&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;&lt;strong&gt;Solution&lt;/strong&gt;&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;ClassNotFoundException&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Missing JAR files&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Check the dependencies in your lib directory and confirm &lt;code&gt;HADOOP_CLASSPATH&lt;/code&gt; points to the correct path&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Table not found&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Database name mismatch&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Check that the database name in &lt;code&gt;t_env.use_database()&lt;/code&gt; matches the AWS Glue database where you registered your table&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Checkpoint failures&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Amazon S3 permissions&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Check that your Amazon S3 bucket policy grants &lt;code&gt;s3:PutObject&lt;/code&gt; for the checkpoint location&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;AWS credential errors&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Missing AWS IAM configuration&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Check that the AWS IAM role attached to your Apache Flink application has &lt;code&gt;glue:GetTable&lt;/code&gt;, &lt;code&gt;glue:GetDatabase&lt;/code&gt;, and &lt;code&gt;s3:GetObject&lt;/code&gt; permissions on the relevant resources&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Snapshot not found&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Table modified during query&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Increase monitor-interval or implement retry logic in your &lt;code&gt;process_record()&lt;/code&gt; function&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Schema mismatch&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Table schema changed between snapshots&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Review Apache Iceberg schema evolution settings and confirm backward compatibility&lt;/td&gt; 
  &lt;/tr&gt; 
 &lt;/tbody&gt; 
&lt;/table&gt; 
&lt;h2&gt;Clean up&lt;/h2&gt; 
&lt;p&gt;To avoid ongoing charges, delete the resources that you created during this walkthrough.&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;Stop your Amazon Managed Service for Apache Flink application. Open the &lt;a href="https://console.aws.amazon.com/flink/" target="_blank" rel="noopener noreferrer"&gt;Amazon Managed Service for Apache Flink console&lt;/a&gt;, choose your application name, choose &lt;strong&gt;Stop&lt;/strong&gt;, and confirm the action. Or use the AWS CLI:&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;code&gt;aws kinesisanalyticsv2 stop-application --application-name your-app-name&lt;/code&gt;&lt;/p&gt; 
&lt;ol start="2"&gt; 
 &lt;li&gt;Delete the Amazon S3 buckets that you created for data storage and checkpoints. For instructions, see &lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/delete-bucket.html" target="_blank" rel="noopener noreferrer"&gt;Deleting a bucket&lt;/a&gt; in the Amazon S3 User Guide.&lt;/li&gt; 
 &lt;li&gt;Remove the Apache Iceberg tables from your &lt;a href="https://docs.aws.amazon.com/glue/latest/dg/console-tables.html" target="_blank" rel="noopener noreferrer"&gt;Data Catalog&lt;/a&gt;.&lt;/li&gt; 
 &lt;li&gt;Delete the &lt;a href="https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_manage_delete.html" target="_blank" rel="noopener noreferrer"&gt;AWS IAM roles and policies&lt;/a&gt; created specifically for this walkthrough.&lt;/li&gt; 
 &lt;li&gt;If you created an Amazon VPC or Amazon VPC endpoints for testing, &lt;a href="https://docs.aws.amazon.com/vpc/latest/userguide/delete-vpc.html" target="_blank" rel="noopener noreferrer"&gt;delete those resources&lt;/a&gt;.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;h2&gt;Conclusion&lt;/h2&gt; 
&lt;p&gt;Maintaining separate streaming and batch pipelines doubles your infrastructure costs, creates data synchronization issues, and adds operational complexity that slows your team down. In this post, you replaced that dual-pipeline architecture with a single system built on Apache Iceberg and Amazon Managed Service for Apache Flink. You configured a Flink environment with the required JAR dependencies, connected it to Data Catalog, and implemented streaming queries that read new records incrementally with exactly-once processing semantics. The same data, the same storage layer, the same schema—accessible to both your real-time and batch consumers.&lt;/p&gt; 
&lt;p&gt;To extend this solution, try these next steps based on your use case:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;If you’re processing high volumes (&amp;gt;10,000 records/sec):&lt;/strong&gt; Start with partition pruning. Add PARTITIONED BY (date_column) to your table definition, this typically reduces query times by 60–80%.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;If you need production monitoring:&lt;/strong&gt; Implement custom &lt;a href="https://docs.aws.amazon.com/cloudwatch/" target="_blank" rel="noopener noreferrer"&gt;Amazon CloudWatch&lt;/a&gt; metrics. Track checkpoint duration, records processed per second, and backpressure to catch issues before they impact your pipeline.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;If you have variable workloads:&lt;/strong&gt; Configure auto scaling for your Apache Flink cluster. See the &lt;a href="https://docs.aws.amazon.com/managed-flink/latest/java/what-is.html" target="_blank" rel="noopener noreferrer"&gt;Amazon Managed Service for Apache Flink Developer Guide&lt;/a&gt; for detailed guidance.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;Share your implementation experience in the comments, your use case, data volumes, latency improvements, and cost reductions help other readers calibrate their expectations. To get started, try the &lt;a href="https://docs.aws.amazon.com/managed-flink/latest/java/what-is.html" target="_blank" rel="noopener noreferrer"&gt;Amazon Managed Service for Apache Flink Developer Guide&lt;/a&gt; and the &lt;a href="https://iceberg.apache.org/docs/latest/" target="_blank" rel="noopener noreferrer"&gt;Apache Iceberg documentation&lt;/a&gt; on the Apache Iceberg website.&lt;/p&gt; 
&lt;hr style="width: 80%"&gt; 
&lt;h2&gt;About the authors&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignleft" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/07/BDB-5291-image-3.jpeg" alt="Headshot of Nikhil" width="100" height="100"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Nikhil Jha&lt;/h3&gt; 
  &lt;p&gt;&lt;strong&gt;Nikhil Jha&lt;/strong&gt;&amp;nbsp;is a Principal Delivery Consultant at AWS Professional Services, helping enterprises navigate complex modernization journeys. He builds data and AI solutions for AWS customers. Outside of work he likes swimming and hiking.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignleft" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/07/BDB-5291-image-4-269x300.png" alt="Headshot of Vyas" width="100" height="100"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Vyas Garigipati&lt;/h3&gt; 
  &lt;p&gt;&lt;strong&gt;Vyas Garigipati&lt;/strong&gt;&amp;nbsp;is a Delivery Consultant at AWS Professional Services, with experience building scalable, distributed systems. He specializes in designing and building AI-powered, high-availability, multi-region architectures and helps customers deploy resilient, production ready solutions on AWS.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignleft" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/07/BDB-5291-image-5.jpeg" alt="Headshot of Vafa" width="100" height="100"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Vafa Ahmadiyeh&lt;/h3&gt; 
  &lt;p&gt;&lt;strong&gt;Vafa Ahmadiyeh&lt;/strong&gt;&amp;nbsp;is a Principal Lead Technologist at AWS, specializing in cloud architecture for the global financial services sector. He partners with major financial institutions to modernize their infrastructure and accelerate their migration to AWS, with a focus on building secure, scalable distributed systems and platforms designed for highly regulated environments.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignleft" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/07/BDB-5291-image-6.png" alt="Headshot of Kaushal" width="100" height="100"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Kaushal (KK) Agrawal&lt;/h3&gt; 
  &lt;p&gt;&lt;strong&gt;Kaushal (KK) Agrawal &lt;/strong&gt;is a Principal Technology Delivery Leader for the Digital Native Segment of AWS Professional Services, working with top-tier customers to deliver innovation at the intersection of AI and Cloud.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
					
			
		
		
			</item>
	</channel>
</rss>