<?xml version="1.0" encoding="UTF-8" standalone="no"?><rss xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:slash="http://purl.org/rss/1.0/modules/slash/" xmlns:sy="http://purl.org/rss/1.0/modules/syndication/" xmlns:wfw="http://wellformedweb.org/CommentAPI/" version="2.0">

<channel>
	<title>AWS Big Data Blog</title>
	<atom:link href="https://aws.amazon.com/blogs/big-data/feed/" rel="self" type="application/rss+xml"/>
	<link>https://aws.amazon.com/blogs/big-data/</link>
	<description>Official Big Data Blog of Amazon Web Services</description>
	<lastBuildDate>Fri, 22 May 2026 15:20:11 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>
	hourly	</sy:updatePeriod>
	<sy:updateFrequency>
	1	</sy:updateFrequency>
	
	<item>
		<title>Automate deployment of data and AI applications with Amazon SageMaker Unified Studio CI/CD CLI</title>
		<link>https://aws.amazon.com/blogs/big-data/automate-deployment-of-data-and-ai-applications-with-amazon-sagemaker-unified-studio-ci-cd-cli/</link>
					
		
		<dc:creator><![CDATA[Saurabh Bhutyani]]></dc:creator>
		<pubDate>Thu, 21 May 2026 19:13:50 +0000</pubDate>
				<category><![CDATA[Amazon SageMaker]]></category>
		<category><![CDATA[Amazon SageMaker Unified Studio]]></category>
		<category><![CDATA[Announcements]]></category>
		<guid isPermaLink="false">93e112544d85fb036fb2559c0aeaa5882aec7a66</guid>

					<description>The CI/CD CLI for Amazon SageMaker Unified Studio (aws-smus-cicd-cli) is an open source command line tool that automates deployment of multi-service data and AI applications across pipeline stages. Data teams define their application once in a YAML manifest, DevOps teams deploy with a single command, and the CLI handles configuration substitution, dependency ordering, and resource provisioning automatically. In this post, we walk through how the CI/CD CLI works, show you how to deploy a real application across environments, and demonstrate how it fits into your existing CI/CD workflows.</description>
										<content:encoded>&lt;p&gt;Organizations building data and AI applications in &lt;a href="https://aws.amazon.com/sagemaker/unified-studio/" rel="noopener" target="_blank"&gt;Amazon SageMaker Unified Studio&lt;/a&gt; combine multiple AWS services, including &lt;a href="https://aws.amazon.com/glue/" rel="noopener" target="_blank"&gt;AWS Glue&lt;/a&gt;, &lt;a href="https://aws.amazon.com/athena/" rel="noopener" target="_blank"&gt;Amazon Athena&lt;/a&gt;, &lt;a href="https://aws.amazon.com/managed-workflows-for-apache-airflow/" target="_blank" rel="noopener noreferrer"&gt;Amazon Managed Workflows for Apache Airflow&lt;/a&gt; (Amazon MWAA), &lt;a href="https://aws.amazon.com/sagemaker/ai/" rel="noopener" target="_blank"&gt;Amazon SageMaker AI&lt;/a&gt;, and &lt;a href="https://aws.amazon.com/quick/quicksight/" rel="noopener" target="_blank"&gt;Amazon Quick Sight&lt;/a&gt;, into single applications. Promoting these applications from development to test and production stages requires substituting service-specific configurations for each stage and provisioning resources in the correct order.&lt;/p&gt; 
&lt;p&gt;Data teams understand which services their applications need but lack continuous integration and continuous delivery (CI/CD) expertise, while DevOps teams understand deployment automation but must learn each AWS service’s provisioning requirements.&lt;/p&gt; 
&lt;p&gt;&lt;a href="https://aws.amazon.com/about-aws/whats-new/2026/04/amazon-smus-ci-cd-cli/" target="_blank" rel="noopener noreferrer"&gt;The CI/CD CLI for Amazon SageMaker Unified Studio&lt;/a&gt; (aws-smus-cicd-cli) is an open source command line tool that automates deployment of multi-service data and AI applications across pipeline stages. Data teams define their application once in a YAML manifest, DevOps teams deploy with a single command, and the CLI handles configuration substitution, dependency ordering, and resource provisioning automatically. For details, see the &lt;a href="https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/cicd.html" target="_blank" rel="noopener noreferrer"&gt;CI/CD CLI documentation&lt;/a&gt;.&lt;/p&gt; 
&lt;p&gt;In this post, we walk through how the CI/CD CLI works, show you how to deploy a real application across environments, and demonstrate how it fits into your existing CI/CD workflows.&lt;/p&gt; 
&lt;h2&gt;Customer spotlight&lt;/h2&gt; 
&lt;p&gt;Bureau Veritas, a global leader in testing, inspection, and certification, operates across multiple SageMaker Unified Studio environments to support its data and AI teams. With their data and DevOps teams working on different parts of the application lifecycle, Bureau Veritas needed a controlled way to promote workloads from development through test to production while preserving clear ownership boundaries between the two teams.&lt;/p&gt; 
&lt;blockquote&gt;
 &lt;p&gt;&lt;em&gt;“We need to promote data and AI applications across SageMaker Unified Studio environments in a controlled way that respects the boundaries between our data teams and our DevOps teams. The CI/CD CLI does exactly that — a single manifest from the data team, a single deploy command from DevOps, and full control over what goes to production.”&lt;/em&gt;&lt;/p&gt; 
 &lt;p&gt;— Gilles Kempf, Architecture Manager, Bureau Veritas&lt;/p&gt;
&lt;/blockquote&gt; 
&lt;h2&gt;How the CI/CD CLI works&lt;/h2&gt; 
&lt;p&gt;The CI/CD CLI introduces a clean separation of concerns between data teams and DevOps teams.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Data teams&lt;/strong&gt; define what to deploy in a declarative YAML manifest (manifest.yaml). The manifest describes the application’s resources, including AWS Glue extract, transform, and load (ETL) jobs, Athena queries, Airflow directed acyclic graphs (DAGs), Quick Sight dashboards, and SageMaker training jobs, along with stage-specific configurations for each environment.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;DevOps teams&lt;/strong&gt; define how and when to deploy using their existing CI/CD systems. They retain full control over their deployment methodology. They choose whether to promote content through git branches, a bundle artifactory, or both; they decide the shape of the pipeline, including which stages to include (dev, staging, pre-prod, prod) and which manual approvals or security gates are required. They run &lt;code&gt;aws-smus-cicd-cli deploy&lt;/code&gt; inside GitHub Actions, Jenkins, or GitLab CI workflows without needing to understand which AWS services the application uses or how SageMaker Unified Studio projects are structured. The CLI is a utility for AWS analytics service deployment, not a CI/CD methodology. Your team’s existing conventions for branches, approvals, and pipeline shape stay exactly as they are.&lt;/p&gt; 
&lt;p&gt;The CLI is the abstraction layer between the two. It reads the manifest, substitutes stage-specific configurations (S3 paths, AWS Identity and Access Management (IAM) roles, account IDs, and connection strings), provisions resources in dependency order, and handles all AWS service interactions.The following diagram illustrates this separation:&lt;/p&gt; 
&lt;p&gt;&lt;img class="aligncenter wp-image-91122 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/08/image-1-3.png" alt="SageMaker CI/CD" width="3852" height="2002"&gt;&lt;/p&gt; 
&lt;h2&gt;Key concepts&lt;/h2&gt; 
&lt;h3&gt;Application manifest&lt;/h3&gt; 
&lt;p&gt;Each stage maps to a dedicated SageMaker Unified Studio project. This one-stage-to-one-project mapping is the foundation of CI/CD isolation: each project has its own domain, IAM boundaries, connections, and data, so changes in dev can never affect prod. For stronger isolation, projects can span different AWS accounts and AWS Regions. For example, dev in a sandbox account and prod in a production account in a different Region. Because each stage is a real SageMaker Unified Studio project, teams can open it in the console at any time to observe workflows, inspect resources, and troubleshoot deployments. Project membership is managed per project, so you control exactly who has access to each stage. For example, developers in dev and a release team in prod.The manifest file is the single source of truth for your application. It declares:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Content&lt;/strong&gt;: application code from git repositories, data files from S3, Quick Sight dashboards, and workflow definitions.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Stages&lt;/strong&gt;: environment-specific project mappings (dev, test, prod, etc.), each isolated as described earlier.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Configuration&lt;/strong&gt;: stage-specific settings that are substituted automatically at deploy time.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;Here is an example manifest for an analytics application with&amp;nbsp;AWS&amp;nbsp;Glue ETL and Quick Sight:&lt;br&gt; applicationName: SalesAnalyticsDashboard&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre class="unlimited-height-code"&gt;&lt;code class="lang-yaml"&gt;content: 
  storage: 
    - name: etl-code 
      include: ["*.py"] 
    - name: workflows 
      include: ["*.yaml"] 
  quicksight: 
    - name: SalesDashboard 
      type: dashboard 
  workflows: 
    - workflowName: sales_etl_pipeline 
      connectionName: default.workflow_serverless 
 
stages: 
  dev: 
    domain: 
      region: us-east-1 
    project: 
      name: analytics-dev 
    deployment_configuration: 
      storage: 
        - name: etl-code 
          connectionName: default.s3_shared 
          targetDirectory: sales/bundle/etl 
        - name: workflows 
          connectionName: default.s3_shared 
          targetDirectory: sales/bundle/workflows 
 
  prod: 
    domain: 
      region: us-west-2 
    project: 
      name: analytics-prod 
    deployment_configuration: 
      storage: 
        - name: etl-code 
          connectionName: default.s3_shared 
          targetDirectory: sales/bundle/etl 
        - name: workflows 
          connectionName: default.s3_shared 
          targetDirectory: sales/bundle/workflows 
      quicksight: 
        assets: 
          - name: SalesDashboard 
            owners: 
              - arn:aws:quicksight:${AWS_REGION}:${AWS_ACCOUNT_ID}:user/default/Admin/* 
&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Each stage must map to a separate SageMaker Unified Studio project, providing full isolation between environments. The CLI substitutes variables like &lt;code&gt;${AWS_ACCOUNT_ID}&lt;/code&gt; and &lt;code&gt;${AWS_REGION}&lt;/code&gt; at deploy time based on the target environment.&lt;/p&gt; 
&lt;h3&gt;Bundles&lt;/h3&gt; 
&lt;p&gt;A &lt;em&gt;bundle&lt;/em&gt; is an immutable, versioned archive of your application. The bundle command reads from a source stage (typically dev) and packages the application code, workflow definitions, and resolved configurations into a self-contained artifact. The deploy command then applies that artifact to one or more target stages (test or prod).&lt;/p&gt; 
&lt;p&gt;This stage-to-bundle-to-stage promotion model supports controlled rollout through quality gates:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-bash"&gt;# Package from dev 
aws-smus-cicd-cli bundle --manifest manifest.yaml 
 
# Deploy to test 
aws-smus-cicd-cli deploy --manifest app.tar.gz --targets test 
 
# Validate the test deployment 
aws-smus-cicd-cli test --manifest manifest.yaml --targets test 
 
# Promote the same bundle to prod 
aws-smus-cicd-cli deploy --manifest app.tar.gz --targets prod 
&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;The same artifact is deployed at every stage without rebuilding, providing audit trails and reproducible deployments for regulated industries.&lt;/p&gt; 
&lt;h3&gt;SageMaker Catalog integration&lt;/h3&gt; 
&lt;p&gt;The CLI manages Amazon SageMaker Catalog resources as part of the deployment process. You can define catalog assets, glossaries, glossary terms, form types, asset types, and metadata forms, in your manifest. During deployment, the CLI searches for assets in the catalog, creates subscription requests for required data access, and waits for approval before proceeding. This automates the data governance workflow that teams previously handled manually.&lt;/p&gt; 
&lt;h2&gt;CLI commands&lt;/h2&gt; 
&lt;p&gt;The CI/CD CLI provides commands that cover the full deployment lifecycle:&lt;/p&gt; 
&lt;table class="styled-table" border="1px" cellpadding="10px"&gt; 
 &lt;tbody&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Command&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Description&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;describe&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Validates the manifest, checks that target projects exist, and confirms the execution role has required permissions. Use –connect to validate against live AWS environments.&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;bundle&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Reads from a source stage and packages application code, workflow definitions, and configurations into an immutable, versioned archive.&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;deploy&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Applies bundle contents to one or more target stages. Provisions resources in dependency order.&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;test&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Runs post-deployment validation to confirm services are running and ready for workloads.&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;create&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Generates a starter manifest from an existing SageMaker Unified Studio project.&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;run&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Triggers Airflow workflow execution on MWAA or Airflow Serverless connections.&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;monitor&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Monitors workflow execution status in real time.&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;logs&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Fetches and streams workflow execution logs.&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;destroy&lt;/td&gt; 
   &lt;td style="padding: 10px;border: 1px solid #dddddd"&gt;Removes deployed resources and projects for cleanup or failure recovery.&lt;/td&gt; 
  &lt;/tr&gt; 
 &lt;/tbody&gt; 
&lt;/table&gt; 
&lt;h2&gt;Walkthrough: deploying a Quick Sight dashboard with AWS Glue ETL&lt;/h2&gt; 
&lt;p&gt;In this section, we walk through deploying an analytics application that uses AWS Glue for ETL, Athena for queries, and Quick Sight for dashboards. This example is available in the GitHub repository.&lt;/p&gt; 
&lt;h3&gt;Use case&lt;/h3&gt; 
&lt;p&gt;An analytics team owns a Sales Analytics Dashboard built on AWS Glue ETL, Athena, and Quick Sight. They want to promote changes from a development environment to production with reproducible builds, automated validation, and a clear approval gate between stages, without writing custom deployment scripts or exposing data engineers to AWS provisioning details.&lt;/p&gt; 
&lt;h3&gt;Solution overview&lt;/h3&gt; 
&lt;p&gt;We use a sample application from the CI/CD CLI GitHub repository that includes AWS Glue ETL scripts, an Airflow workflow definition, a Quick Sight dashboard bundle, and integration tests. A single manifest.yaml describes the application and its dev and prod stages. The CLI handles the full lifecycle: bundle the app from dev, deploy it to test, run validation, and promote the same immutable artifact to prod.&lt;/p&gt; 
&lt;h3&gt;Prerequisites&lt;/h3&gt; 
&lt;p&gt;Before you begin, make sure you have the following:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Python 3.8 or later.&lt;/li&gt; 
 &lt;li&gt;AWS credentials with permissions to deploy to your SageMaker Unified Studio projects. For details on configuring credentials, see &lt;a href="https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html" target="_blank" rel="noopener noreferrer"&gt;Configuration and credential file settings in the AWS CLI&lt;/a&gt;.&lt;/li&gt; 
 &lt;li&gt;Existing SageMaker Unified Studio projects for your target stages.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h3&gt;Solution architecture&lt;/h3&gt; 
&lt;p&gt;Each stage in the manifest maps to a dedicated SageMaker Unified Studio project (see the separation-of-concerns diagram in “How the CI/CD CLI works” earlier in this post). At deploy time, the CLI uploads ETL scripts and workflow definitions to the project’s S3 storage connection, provisions the Airflow workflow in MWAA Serverless, runs the workflow to create AWS Glue jobs and databases, and imports the Quick Sight dashboard. The same bundle artifact is applied to every downstream stage, ensuring dev, test, and prod stay in sync while remaining fully isolated.&lt;/p&gt; 
&lt;h3&gt;Solution implementation&lt;/h3&gt; 
&lt;h4&gt;Step 1: Install the CLI&lt;/h4&gt; 
&lt;p&gt;Install the CLI from PyPI:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;div class="hide-language"&gt; 
  &lt;pre&gt;&lt;code class="lang-code"&gt;pip install aws-smus-cicd-cli&lt;/code&gt;&lt;/pre&gt; 
 &lt;/div&gt; 
&lt;/div&gt; 
&lt;h4&gt;Step 2: Create or customize a manifest&lt;/h4&gt; 
&lt;p&gt;Clone the repository and start from the analytics example:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;div class="hide-language"&gt; 
  &lt;pre&gt;&lt;code class="lang-code"&gt;git clone https://github.com/aws/CICD-for-SageMakerUnifiedStudio.gitcd CICD-for-SageMakerUnifiedStudio/examples/analytic-workflow/dashboard-glue-quick&lt;/code&gt;&lt;/pre&gt; 
 &lt;/div&gt; 
&lt;/div&gt; 
&lt;p&gt;The example includes AWS Glue ETL scripts, an Airflow workflow definition, a Quick Sight dashboard bundle, and integration tests. Open manifest.yaml and update the project, domain, and deployment_configuration values under each stage so they match your own SageMaker Unified Studio projects and connection names.Alternatively, generate a manifest from an existing project: &lt;code&gt;aws-smus-cicd-cli create --domain-id &amp;lt;your-domain-id&amp;gt; --dev-project-id &amp;lt;your-project-id&amp;gt;&lt;/code&gt;&lt;/p&gt; 
&lt;h4&gt;Step 3: Validate your configuration&lt;/h4&gt; 
&lt;p&gt;Run the describe command with &lt;code&gt;--connect&lt;/code&gt; to verify your environment is ready. This connects to your AWS environment and validates that target projects exist, the execution role has the required permissions, and connections are reachable. Fix any issues before deploying.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;div class="hide-language"&gt; 
  &lt;pre&gt;&lt;code class="lang-code"&gt;aws-smus-cicd-cli describe --manifest manifest.yaml --connect&lt;/code&gt;&lt;/pre&gt; 
 &lt;/div&gt; 
&lt;/div&gt; 
&lt;h4&gt;Step 4: Deploy&lt;/h4&gt; 
&lt;p&gt;Run the deployment:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;aws-smus-cicd-cli deploy --targets test --manifest manifest&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;div class="hide-language"&gt;
 During deployment, the CLI:
&lt;/div&gt; 
&lt;ol&gt; 
 &lt;li&gt;Uploads ETL scripts and workflow definitions to S3 using the project’s storage connection.&lt;/li&gt; 
 &lt;li&gt;Creates the Airflow workflow in MWAA Serverless.&lt;/li&gt; 
 &lt;li&gt;Runs the workflow, which provisions AWS Glue jobs, creates databases, and runs ETL transformations.&lt;/li&gt; 
 &lt;li&gt;Imports the Quick Sight dashboard and refreshes datasets with the latest data.&lt;/li&gt; 
 &lt;li&gt;Processes any catalog asset subscriptions defined in the manifest.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;h4&gt;Step 5: Validate&lt;/h4&gt; 
&lt;p&gt;Run post-deployment validation to confirm services are running and ready for workloads:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;aws-smus-cicd-cli test --manifest manifest.yaml --targets test&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;h4&gt;Step 6: Promote to production&lt;/h4&gt; 
&lt;p&gt;Promote the same bundle artifact that was validated in the test stage to production. This guarantees the exact same artifact runs in prod:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;# Promote the same bundle that was validated in test to prod

aws-smus-cicd-cli deploy --manifest app.tar.gz --targets prod&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;h2&gt;Integrating with GitHub Actions&lt;/h2&gt; 
&lt;p&gt;The CLI works with existing CI/CD solutions. The GitHub repository includes reusable workflow templates that DevOps teams can adopt directly.The following is an example of a GitHub Actions workflow that implements a full bundle-based deployment pipeline:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre class="unlimited-height-code"&gt;&lt;code class="lang-yaml"&gt;name: Deploy Analytics Application 
on: 
  push: 
    branches: [main] 
 
jobs: 
  deploy-test: 
    runs-on: ubuntu-latest 
    steps: 
      - uses: actions/checkout@v4 
 
      - name: Install CLI 
        run: pip install aws-smus-cicd-cli 
 
      - name: Configure AWS credentials 
        uses: aws-actions/configure-aws-credentials@v4 
        with: 
          role-to-assume: ${{ secrets.AWS_ROLE_ARN }} 
          aws-region: us-east-1 
 
      - name: Validate 
        run: aws-smus-cicd-cli describe --manifest manifest.yaml --connect 
 
      - name: Bundle 
        run: aws-smus-cicd-cli bundle --manifest manifest.yaml 
 
      - name: Deploy to test 
        run: aws-smus-cicd-cli deploy --targets test --manifest manifest.yaml 
 
      - name: Run tests 
        run: aws-smus-cicd-cli test --manifest manifest.yaml --targets test 
 
  deploy-prod: 
    needs: deploy-test 
    runs-on: ubuntu-latest 
    environment: production 
    steps: 
      - uses: actions/checkout@v4 
 
      - name: Install CLI 
        run: pip install aws-smus-cicd-cli 
 
      - name: Configure AWS credentials 
        uses: aws-actions/configure-aws-credentials@v4 
        with: 
          role-to-assume: ${{ secrets.AWS_PROD_ROLE_ARN }} 
          aws-region: us-west-2 
 
      - name: Deploy to production 
        run: aws-smus-cicd-cli deploy --targets prod --manifest manifest.yaml
&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;The CLI also works with &lt;a href="https://www.jenkins.io/" target="_blank" rel="noopener noreferrer"&gt;Jenkins&lt;/a&gt;, &lt;a href="https://docs.gitlab.com/ee/ci/" target="_blank" rel="noopener noreferrer"&gt;GitLab CI&lt;/a&gt;, and &lt;a href="https://azure.microsoft.com/en-us/products/devops/" target="_blank" rel="noopener noreferrer"&gt;Azure DevOps&lt;/a&gt;. See the &lt;a href="https://github.com/aws/CICD-for-SageMakerUnifiedStudio" target="_blank" rel="noopener noreferrer"&gt;CI/CD integration guide&lt;/a&gt; for additional examples.&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;In the next section, we cover which AWS services and workload types the CLI supports.&lt;/em&gt;&lt;/p&gt; 
&lt;h2&gt;Supported workloads&lt;/h2&gt; 
&lt;p&gt;The CLI deploys applications that span the following AWS services through Airflow workflow definitions:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Analytics and BI:&lt;/strong&gt; AWS Glue ETL jobs and crawlers, Amazon Athena queries, Amazon Quick Sight dashboards, Amazon EMR jobs, Amazon Redshift queries.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Machine learning:&lt;/strong&gt; SageMaker training jobs, ML model endpoints, SageMaker AI Pipelines.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Code and workflows:&lt;/strong&gt; Jupyter notebooks, Python scripts, Airflow DAGs (MWAA and MWAA Serverless).&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Data and storage:&lt;/strong&gt; S3 data files, Git repositories, SageMaker Catalog resources (glossaries, glossary terms, form types, asset types, assets, data products, metadata forms).&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;The examples directory includes working applications for each of these patterns, with manifests, workflow definitions, and integration tests.&lt;/p&gt; 
&lt;h2&gt;Failure recovery&lt;/h2&gt; 
&lt;p&gt;If a deployment fails, the CLI stops at the point of failure and reports the error with a detailed stack trace. To recover:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;Run &lt;code&gt;aws-smus-cicd-cli describe --connect&lt;/code&gt; to check which resources exist and which permissions are missing.&lt;/li&gt; 
 &lt;li&gt;Fix the issue and rerun &lt;code&gt;aws-smus-cicd-cli deploy&lt;/code&gt;.&lt;/li&gt; 
 &lt;li&gt;For bundle-based deployments, redeploy a previous bundle version.&lt;/li&gt; 
 &lt;li&gt;Use &lt;code&gt;aws-smus-cicd-cli destroy --targets &amp;lt;target&amp;gt; --force&lt;/code&gt; to clean up a failed deployment.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;For detailed rollback procedures, see the &lt;a href="https://github.com/aws/CICD-for-SageMakerUnifiedStudio/blob/main/docs/rollback-guide.md" target="_blank" rel="noopener noreferrer"&gt;Rollback Guide.&lt;/a&gt;&lt;/p&gt; 
&lt;h2&gt;Conclusion&lt;/h2&gt; 
&lt;p&gt;In this post, you learned how the Amazon SageMaker Unified Studio CI/CD CLI gives data and DevOps teams a clean separation of concerns: data teams describe their application once in a YAML manifest, and DevOps teams deploy it with a single command through their existing CI/CD pipelines. You saw how stages map to isolated SageMaker Unified Studio projects (optionally spanning AWS accounts and Regions), how bundles provide immutable, reproducible promotion through test and production, and how the CLI integrates with GitHub Actions, Jenkins, GitLab CI, and Azure DevOps. You also walked through deploying a Glue-and-Quick-Sight analytics application from dev through to prod.&lt;/p&gt; 
&lt;h2&gt;Get started&lt;/h2&gt; 
&lt;p&gt;The CI/CD CLI is available at no additional cost in all AWS Regions where Amazon SageMaker Unified Studio is available. You pay only for the underlying AWS resources provisioned during deployment.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Use the following steps to try it out:&lt;/strong&gt;&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;Install the CLI: 
  &lt;div class="hide-language"&gt;
   &lt;code class="lang-bash"&gt;pip install aws-smus-cicd-cli&lt;/code&gt;
  &lt;/div&gt; &lt;/li&gt; 
 &lt;li&gt;Browse the example applications for analytics and ML patterns.&lt;/li&gt; 
 &lt;li&gt;Follow the &lt;a href="https://docs.aws.amazon.com/sagemaker-unified-studio/latest/userguide/cicd.html" target="_blank" rel="noopener noreferrer"&gt;CI/CD CLI documentation&lt;/a&gt; to deploy your first application in 10 minutes.&lt;/li&gt; 
 &lt;li&gt;Review the Admin Guide for infrastructure setup.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;For feedback and bug reports, open an issue on the &lt;a href="https://github.com/aws/CICD-for-SageMakerUnifiedStudio" target="_blank" rel="noopener noreferrer"&gt;GitHub repository&lt;/a&gt;.&lt;/p&gt; 
&lt;hr style="width: 80%"&gt; 
&lt;h2&gt;About the authors&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="size-full wp-image-29797" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2024/06/26/rameshsi.png" alt="Ramesh H Singh" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Ramesh H Singh&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="http://www.linkedin.com/in/ramesh-harisaran-singh" target="_blank" rel="noopener"&gt;Ramesh H Singh&lt;/a&gt; is a Senior Product Manager Technical (External Services) at AWS in Seattle, Washington, currently with the Amazon SageMaker team. He is passionate about building high-performance ML/AI and analytics products that help enterprise customers achieve their critical goals using cutting-edge technology.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="size-full wp-image-29797" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2025/04/23/vasu.jpeg" alt="Vasudevan Venkataramanan" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Vasudevan Venkataramanan&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/vasudevan-venkataramanan-1271b611/" target="_blank" rel="noopener"&gt;Vasudevan Venkataramanan&lt;/a&gt; is a Senior Software Engineer on the Amazon SageMaker Unified Studio team. He is responsible for technical direction of scheduling and orchestration within SageMaker Unified Studio. Outside of his professional work, he enjoys spending time with his kid, and playing pickleball and cricket.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="size-full wp-image-29797" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2020/09/21/amir-bar-100.jpg" alt="Amir Bar Or" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Amir Bar Or&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/amir-bar-or-abb911/" target="_blank" rel="noopener"&gt;Amir&lt;/a&gt; is a Principal Engineer at AWS specializing in analytics, distributed systems, identity, and database internals. He founded Amazon DataZone and SageMaker Unified Studio, and works across AWS analytics services — driving innovation, tackling complex technical challenges, and raising the bar for engineering excellence.&lt;/p&gt;
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="aligncenter size-full wp-image-29797" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/20/bdb5967a1.png" alt="Nikita Arbuzov" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Nikita Arbuzov&lt;/h3&gt; 
  &lt;p&gt;&lt;a target="_blank" href="author LinkedIn" rel="noopener"&gt;Nikita&lt;/a&gt; is Software Engineer on the Amazon SageMaker Unified Studio team. He is responsible for building support for CI/CD features within SageMaker Unified Studio.&lt;/p&gt;
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="size-full wp-image-29797" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2020/11/03/Saurabh-Bhutyani.jpg" alt="Saurabh Bhutyani" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Saurabh Bhutyani&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="http://www.linkedin.com/in/s4saurabh" target="_blank" rel="noopener"&gt;Saurabh Bhutyani&lt;/a&gt; is a Principal Analytics Specialist Solutions Architect at AWS. He is passionate about new technologies. He joined AWS in 2019 and works with customers to provide architectural guidance for running generative AI use cases, scalable analytics solutions and data mesh architectures using AWS services like Amazon Bedrock, Amazon SageMaker Unified Studio, Amazon EMR, Amazon Athena, AWS Glue, AWS Lake Formation, and Amazon DataZone.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
					
			
		
		
			</item>
		<item>
		<title>A systematic approach to benchmarking SQL processing engines on AWS</title>
		<link>https://aws.amazon.com/blogs/big-data/a-systematic-approach-to-benchmarking-sql-processing-engines-on-aws/</link>
					
		
		<dc:creator><![CDATA[Anubhav Awasthi]]></dc:creator>
		<pubDate>Tue, 19 May 2026 15:44:02 +0000</pubDate>
				<category><![CDATA[Advanced (300)]]></category>
		<category><![CDATA[Amazon Athena]]></category>
		<category><![CDATA[Amazon EMR]]></category>
		<category><![CDATA[Amazon Redshift]]></category>
		<category><![CDATA[Analytics]]></category>
		<category><![CDATA[Technical How-to]]></category>
		<guid isPermaLink="false">bf0d9f4fbdbfaaeb6a0f5b4045e7f3de16c5b8ea</guid>

					<description>Selecting the right SQL processing solution for large-scale data analytics is a critical decision for organizations. As data volumes grow exponentially, the technology landscape has evolved to offer diverse options for processing and analyzing this information efficiently. This post presents a systematic framework for evaluating and benchmarking SQL processing engines on AWS, using Apache JMeter to conduct practical performance testing at scale.</description>
										<content:encoded>&lt;p&gt;Selecting the right SQL processing solution for large-scale data analytics is a critical decision for organizations. As data volumes grow exponentially, the technology landscape has evolved to offer diverse options for processing and analyzing this information efficiently. This post presents a systematic framework for evaluating and benchmarking SQL processing engines on AWS, using &lt;a href="https://jmeter.apache.org/" target="_blank" rel="noopener noreferrer"&gt;Apache JMeter&lt;/a&gt; to conduct practical performance testing at scale.&lt;/p&gt; 
&lt;h2&gt;The AWS analytics ecosystem&lt;/h2&gt; 
&lt;p&gt;AWS offers a rich portfolio of SQL processing solutions to meet various analytical needs:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Serverless query services&lt;/strong&gt; – &lt;a href="https://aws.amazon.com/athena/" target="_blank" rel="noopener noreferrer"&gt;Amazon Athena&lt;/a&gt; is a serverless, interactive query service that uses standard SQL to analyze data in &lt;a href="https://aws.amazon.com/s3" target="_blank" rel="noopener noreferrer"&gt;Amazon Simple Storage Service&lt;/a&gt; (Amazon S3), offering automatic scaling, parallel query execution, and pay-per-query pricing with no infrastructure management required&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Data warehouse solutions&lt;/strong&gt; – &lt;a href="https://aws.amazon.com/redshift/" target="_blank" rel="noopener noreferrer"&gt;Amazon Redshift&lt;/a&gt; offers scalable, high-performance cloud data warehousing with serverless options, zero-ETL integrations, AI-powered query assistance, and seamless machine learning (ML) integration for modern analytics at scale&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Managed open source engines&lt;/strong&gt; – &lt;a href="https://aws.amazon.com/emr/" target="_blank" rel="noopener noreferrer"&gt;Amazon EMR&lt;/a&gt; supports Apache Spark SQL, Apache Trino (formerly PrestoSQL), and other distributed query frameworks&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Self-managed options&lt;/strong&gt; – You can deploy open source engines like Apache Spark, Apache Flink, and Trino on &lt;a href="https://aws.amazon.com/eks/" target="_blank" rel="noopener noreferrer"&gt;Amazon Elastic Kubernetes Service&lt;/a&gt; (Amazon EKS) for greater control&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Partner solutions&lt;/strong&gt; – You can access specialized big data analytics tools through &lt;a href="https://aws.amazon.com/marketplace" target="_blank" rel="noopener noreferrer"&gt;AWS Marketplace&lt;/a&gt;&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;These options are further enhanced by modern open table formats such as &lt;a href="https://iceberg.apache.org/" target="_blank" rel="noopener noreferrer"&gt;Apache Iceberg&lt;/a&gt;, &lt;a href="https://delta.io/" target="_blank" rel="noopener noreferrer"&gt;Delta Lake&lt;/a&gt;, and &lt;a href="https://hudi.apache.org/" target="_blank" rel="noopener noreferrer"&gt;Apache Hudi&lt;/a&gt;, which bring crucial enterprise features like ACID (Atomicity, Consistency, Isolation, and Durability) transactions, schema evolution, and time travel capabilities to data lakes. These SQL processing solutions operate under the AWS Shared Responsibility Model. AWS manages the security of the underlying infrastructure and services, and customers are responsible for secure configuration, access management, and data protection within their testing environments. This division of responsibility remains important when evaluating and benchmarking different SQL engines. Proper security configuration and implementation by customers is essential for maintaining a secure analytics environment.&lt;/p&gt; 
&lt;h2&gt;Evaluation challenges in SQL engine selection&lt;/h2&gt; 
&lt;p&gt;The rich ecosystem of SQL processing options creates significant evaluation challenges. Each SQL engine employs unique architectural approaches and optimization strategies, making direct comparisons complex. Organizations embarking on this evaluation journey face several interconnected obstacles:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Creating environments that accurately reflect production scenarios&lt;/li&gt; 
 &lt;li&gt;Developing test datasets that mirror real-world data characteristics and volumes&lt;/li&gt; 
 &lt;li&gt;Replicating real-world query patterns and concurrency levels&lt;/li&gt; 
 &lt;li&gt;Maintaining uniform testing conditions across different engine architectures&lt;/li&gt; 
 &lt;li&gt;Controlling infrastructure expenses throughout the evaluation process&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;Performance considerations at petabyte scale&lt;/h2&gt; 
&lt;p&gt;When evaluating solutions for petabyte-scale deployments, the complexity intensifies considerably. Several critical factors come into play:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Resource management&lt;/strong&gt; – Distributed SQL engines require precise balancing of CPU, memory, and storage resources. Suboptimal resource allocation can lead to query failures and performance degradation, particularly as data volumes grow&lt;strong&gt;.&lt;/strong&gt;&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Data distribution patterns&lt;/strong&gt; – How data is distributed across partitions or nodes significantly impacts query performance. Data skew can create processing bottlenecks, with some nodes handling disproportionate workloads while others remain underutilized.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Concurrency handling&lt;/strong&gt; – High-concurrency environments demand sophisticated workload scheduling and resource isolation mechanisms. The ability to maintain consistent performance under varying concurrent loads becomes a critical differentiator between solutions.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Meaningful metrics&lt;/strong&gt; – Performance evaluation at scale requires comprehensive metrics analysis: 
  &lt;ul&gt; 
   &lt;li&gt;Mean, median, and percentile response times (particularly p90 and p95)&lt;/li&gt; 
   &lt;li&gt;Query throughput under varying concurrency levels&lt;/li&gt; 
   &lt;li&gt;Scalability characteristics across diverse workload types&lt;/li&gt; 
   &lt;li&gt;Resource utilization efficiency during peak loads&lt;/li&gt; 
  &lt;/ul&gt; &lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;Limitations of traditional benchmarks&lt;/h2&gt; 
&lt;p&gt;Although industry-standard benchmarks like TPC-DS and TPC-H provide valuable insights, our experience with multiple customer engagements has shown that tailored, workload-specific testing often reveals performance characteristics not captured by these standardized tests. This is especially true for complex, multi-tenant environments with diverse query patterns. Organizations that complement standard benchmarks with workload-specific testing typically experience shorter proof-of-concept cycles, optimized evaluation costs, and more efficient testing operations. This comprehensive approach helps reduce uncertainty in the final solution selection process.&lt;/p&gt; 
&lt;h2&gt;Prerequisites&lt;/h2&gt; 
&lt;p&gt;Before you dive into the evaluation process, make sure you have the following prerequisites:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;An AWS account with appropriate permissions to create and manage &lt;a href="https://aws.amazon.com/ec2" target="_blank" rel="noopener noreferrer"&gt;Amazon Elastic Compute Cloud&lt;/a&gt; (Amazon EC2) instances and access the SQL engines you plan to benchmark.&lt;/li&gt; 
 &lt;li&gt;Basic familiarity with AWS services, particularly Amazon EC2 and the SQL engines you intend to evaluate (such as Athena, Amazon Redshift, or Amazon EMR).&lt;/li&gt; 
 &lt;li&gt;Experience with SQL and data analytics concepts.&lt;/li&gt; 
 &lt;li&gt;Access to the SQL engines you choose to benchmark. This post assumes you’ve already set up the engines you want to test. For setup instructions, refer to the AWS documentation for each service.&lt;/li&gt; 
 &lt;li&gt;A dataset suitable for your benchmarking needs. Dataset creation and loading are not covered in this post. &lt;a href="https://aws.amazon.com/blogs/big-data/build-petabyte-scale-synthetic-test-data-with-amazon-emr-on-ec2/" target="_blank" rel="noopener noreferrer"&gt;Build petabyte-scale synthetic test data with Amazon EMR on EC2&lt;/a&gt; provides prescriptive guidance to generate test datasets at scale. Make sure your test datasets are stored in S3 buckets with encryption enabled (using SSE-KMS or SSE-S3) and that all service connections use TLS for data in transit.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;Benefits of Apache JMeter&lt;/h2&gt; 
&lt;p&gt;As organizations scale their analytics workloads to petabyte levels, there is a growing need for a robust, structured approach to SQL query performance testing. Although many organizations develop custom testing frameworks or use various benchmarking tools, these approaches often lack standardization and can be difficult to replicate across different SQL engines. The complexity of modern data architectures, combined with the variety of available SQL processing solutions, demands a systematic evaluation methodology. Apache JMeter emerges as a powerful solution to address this challenge. Though traditionally known for web application testing, JMeter’s extensible architecture and robust feature set make it particularly well-suited for SQL performance testing at scale.JMeter offers several advantages for evaluating SQL engines:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Support for multiple protocols and connections&lt;/li&gt; 
 &lt;li&gt;Ability to simulate complex concurrent workloads&lt;/li&gt; 
 &lt;li&gt;Built-in performance metrics and reporting&lt;/li&gt; 
 &lt;li&gt;Extensible architecture for custom testing scenarios&lt;/li&gt; 
 &lt;li&gt;Integration capabilities with continuous integration and continuous delivery (CI/CD) pipelines&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;Through this proposed framework, which has been validated across multiple customer engagements at petabyte scale, we aim to help organizations make more informed decisions when selecting a SQL processing solution. Our experience working with customers to assess various AWS Analytics services and open source solutions has demonstrated that a systematic evaluation approach significantly reduces proof-of-concept cycles and optimizes resource investments. This framework has helped organizations effectively evaluate services like Athena, Amazon Redshift, and Amazon EMR, alongside open source solutions such as Trino on Amazon EKS, based on their specific workload profiles and performance requirements.With this methodology, organizations can accomplish the following:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Navigate the complex landscape of large-scale data processing technologies&lt;/li&gt; 
 &lt;li&gt;Reduce proof-of-concept cycles from months to weeks&lt;/li&gt; 
 &lt;li&gt;Minimize infrastructure costs during evaluation phases&lt;/li&gt; 
 &lt;li&gt;Make data-driven decisions about technology selection&lt;/li&gt; 
 &lt;li&gt;Better align technology choices with business requirements&lt;/li&gt; 
 &lt;li&gt;Establish repeatable testing patterns for future evaluations&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;Testing methodology in practice&lt;/h2&gt; 
&lt;p&gt;A successful SQL engine evaluation requires understanding and replicating real-world workload patterns. Our methodology, refined through numerous customer engagements, focuses on comprehensive testing across multiple dimensions while remaining adaptable to specific organizational needs.&lt;/p&gt; 
&lt;h3&gt;Query pattern selection&lt;/h3&gt; 
&lt;p&gt;We begin by selecting representative query patterns that mirror production workloads:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Aggregation queries that summarize large datasets using operations like SUM, AVG, and COUNT&lt;/li&gt; 
 &lt;li&gt;Complex join operations that test the engine’s ability to combine data efficiently across multiple tables&lt;/li&gt; 
 &lt;li&gt;String operations that evaluate text processing capabilities&lt;/li&gt; 
 &lt;li&gt;Nested queries that assess the engine’s optimization capabilities for complex query structures&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;A carefully selected set of 8–10 queries typically provides sufficient coverage while keeping the evaluation manageable. These should reflect your actual workload characteristics and business requirements.&lt;/p&gt; 
&lt;h3&gt;Data volume variations&lt;/h3&gt; 
&lt;p&gt;Testing across different data volumes is important for understanding scalability characteristics. We structure our tests around varying data scan ranges:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Small-scale scans&lt;/strong&gt; – Queries accessing 1–7 days of data (megabytes to gigabytes)&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Large-scale scans&lt;/strong&gt; – Queries spanning 14–30 days (terabytes to petabytes)&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;This approach evaluates both I/O efficiency with large datasets and metadata handling with smaller, frequent queries, helping understand how services like Amazon EMR, Amazon Redshift, or Athena optimize query execution across different access patterns.&lt;/p&gt; 
&lt;h3&gt;Concurrency testing&lt;/h3&gt; 
&lt;p&gt;Real-world analytics environments rarely process single queries in isolation. Our methodology incorporates the following features:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Progressive concurrency testing starting at lower levels (typically 16, 32, 64, and 128 parallel queries), though these numbers can be adjusted based on your test infrastructure capacity and specific requirements. We recommend starting with smaller concurrency levels and gradually scaling up to understand performance characteristics&lt;/li&gt; 
 &lt;li&gt;Varied query complexity and frequency (referred to as &lt;em&gt;query weights&lt;/em&gt;) to simulate realistic workload distributions. This means some queries are run more often or are more resource-intensive than others, mimicking real-world usage patterns.&lt;/li&gt; 
 &lt;li&gt;Mixed query patterns running simultaneously to test resource management.&lt;/li&gt; 
 &lt;li&gt;Consistent execution across different date ranges to evaluate scaling behavior.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;This approach is particularly important when evaluating managed services like the workload management capabilities of Amazon Redshift or the resource allocation strategies of Amazon EMR.&lt;/p&gt; 
&lt;h3&gt;Query weight distribution&lt;/h3&gt; 
&lt;p&gt;Production environments typically see varying frequencies of different query types. Our framework incorporates weighted query distribution to simulate real-world scenarios more accurately. In a typical distribution, frequent lightweight queries might represent 60% of the workload, complex analytical queries might comprise 30%, and resource-intensive data processing operations might make up the remaining 10%.This weighted approach makes sure performance testing reflects actual usage patterns rather than artificial benchmarking scenarios. The exact distribution should mirror your organization’s specific workload patterns.&lt;/p&gt; 
&lt;h3&gt;Sequential vs. concurrent testing&lt;/h3&gt; 
&lt;p&gt;Our methodology implements two distinct testing phases:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Sequential testing&lt;/strong&gt; – Establishes baseline performance metrics: 
  &lt;ul&gt; 
   &lt;li&gt;Runs each query type independently across different date ranges&lt;/li&gt; 
   &lt;li&gt;Runs multiple iterations to provide consistency and identify variability&lt;/li&gt; 
   &lt;li&gt;Helps understand individual query performance characteristics&lt;/li&gt; 
  &lt;/ul&gt; &lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Concurrent testing&lt;/strong&gt; – Simulates real-world multi-user scenarios: 
  &lt;ul&gt; 
   &lt;li&gt;Implements weighted query distributions&lt;/li&gt; 
   &lt;li&gt;Tests different concurrency levels to identify scaling limitations&lt;/li&gt; 
   &lt;li&gt;Evaluates resource management capabilities of different engines&lt;/li&gt; 
  &lt;/ul&gt; &lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;JMeter efficiently implements both testing phases while maintaining consistent test conditions across SQL engines. Its ability to handle various JDBC connections makes it particularly suitable for testing AWS analytics services.Through this structured approach, organizations can gather comprehensive performance data reflecting their specific use cases, enabling informed SQL engine selection decisions while maintaining core principles of systematic evaluation and realistic workload simulation.&lt;/p&gt; 
&lt;h2&gt;Test plans&lt;/h2&gt; 
&lt;p&gt;To evaluate SQL engines’ performance under varying workloads, we designed two test scenarios: sequential and concurrent execution plans. Each scenario was executed across different data volumes by adjusting the query date range filters to cover 1, 7, 14, and 30 days. These variations simulate typical analytical workloads with progressively increasing data sizes.For sequential runs, each test was treated as a distinct batch, grouping all queries (Query 1 to Query 9) under the same date range—each query will scan data for 1, 7, 14, and 30 days with appropriate date filtering in the query’s where predicate. We used JMeter to capture average query response times for each batch. This configuration was run three times, and the final metrics reflect the average response time across these iterations to ensure reliability and account for environmental variance.Although three iterations provide initial insights, if you observe significant variations in results (typically more than 10% deviation between runs), consider expanding to 10 or more iterations. This additional sampling helps establish statistical significance, identify true performance patterns, and distinguish outliers (beyond three standard deviations) from normal variations. Document any consistent anomalies, because they may indicate important performance or security considerations for your specific environment.The following table shows the sample test plans template for the sequential test plan run.&lt;/p&gt; 
&lt;p&gt;&amp;nbsp;&lt;/p&gt; 
&lt;table class="styled-table" border="1px" cellpadding="10px"&gt; 
 &lt;tbody&gt; 
  &lt;tr&gt; 
   &lt;td rowspan="2"&gt;&lt;strong&gt;Dataset Time Range&lt;/strong&gt;&lt;/td&gt; 
   &lt;td rowspan="2"&gt;&lt;strong&gt;Run&lt;/strong&gt;&lt;/td&gt; 
   &lt;td colspan="9"&gt;&lt;strong&gt;Query Weights &lt;/strong&gt;&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;strong&gt;Query 1&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Query 2&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Query 3&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Query 4&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Query 5&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Query 6&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Query 7&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Query 8&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Query 9&lt;/strong&gt;&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td rowspan="4"&gt;1 day&lt;/td&gt; 
   &lt;td&gt;Run 1&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Run 2&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Run 3&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Avg&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td rowspan="4"&gt;7 days&lt;/td&gt; 
   &lt;td&gt;Run 1&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Run 2&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Run 3&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Avg&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td rowspan="4"&gt;14 days&lt;/td&gt; 
   &lt;td&gt;Run 1&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Run 2&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Run 3&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Avg&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td rowspan="4"&gt;30 days&lt;/td&gt; 
   &lt;td&gt;Run 1&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Run 2&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Run 3&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Avg&lt;/td&gt; 
  &lt;/tr&gt; 
 &lt;/tbody&gt; 
&lt;/table&gt; 
&lt;p&gt;For the concurrent test plan, we introduced a probabilistic weighted distribution to the queries (Query 1 to Query 9), simulating a more realistic production-like environment where query frequency varies based on business relevance and usage patterns. This added a layer of complexity to better reflect how the SQL engine would perform under real-world concurrent access patterns.The following table shows the sample test plans template for the concurrent test plan run.&lt;/p&gt; 
&lt;table class="styled-table" border="1px" cellpadding="10px"&gt; 
 &lt;tbody&gt; 
  &lt;tr&gt; 
   &lt;td rowspan="2"&gt;&lt;strong&gt;Dataset Time Range&lt;/strong&gt;&lt;/td&gt; 
   &lt;td rowspan="2"&gt;&lt;strong&gt;Concurrent Runs&lt;/strong&gt;&lt;/td&gt; 
   &lt;td colspan="9"&gt;&lt;strong&gt;Query Weights&lt;/strong&gt;&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;strong&gt;Query 1&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Query 2&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Query 3&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Query 4&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Query 5&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Query 6&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Query 7&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Query 8&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Query 9&lt;/strong&gt;&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td rowspan="5"&gt;1 days&lt;/td&gt; 
   &lt;td&gt;8&lt;/td&gt; 
   &lt;td&gt;11%&lt;/td&gt; 
   &lt;td&gt;11%&lt;/td&gt; 
   &lt;td&gt;11%&lt;/td&gt; 
   &lt;td&gt;11%&lt;/td&gt; 
   &lt;td&gt;11%&lt;/td&gt; 
   &lt;td&gt;11%&lt;/td&gt; 
   &lt;td&gt;11%&lt;/td&gt; 
   &lt;td&gt;11%&lt;/td&gt; 
   &lt;td&gt;11%&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;16&lt;/td&gt; 
   &lt;td&gt;10%&lt;/td&gt; 
   &lt;td&gt;5%&lt;/td&gt; 
   &lt;td&gt;24%&lt;/td&gt; 
   &lt;td&gt;5%&lt;/td&gt; 
   &lt;td&gt;5%&lt;/td&gt; 
   &lt;td&gt;5%&lt;/td&gt; 
   &lt;td&gt;24%&lt;/td&gt; 
   &lt;td&gt;14%&lt;/td&gt; 
   &lt;td&gt;10%&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;32&lt;/td&gt; 
   &lt;td&gt;8%&lt;/td&gt; 
   &lt;td&gt;3%&lt;/td&gt; 
   &lt;td&gt;24%&lt;/td&gt; 
   &lt;td&gt;5%&lt;/td&gt; 
   &lt;td&gt;5%&lt;/td&gt; 
   &lt;td&gt;5%&lt;/td&gt; 
   &lt;td&gt;24%&lt;/td&gt; 
   &lt;td&gt;16%&lt;/td&gt; 
   &lt;td&gt;8%&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;64&lt;/td&gt; 
   &lt;td&gt;7%&lt;/td&gt; 
   &lt;td&gt;3%&lt;/td&gt; 
   &lt;td&gt;24%&lt;/td&gt; 
   &lt;td&gt;6%&lt;/td&gt; 
   &lt;td&gt;4%&lt;/td&gt; 
   &lt;td&gt;6%&lt;/td&gt; 
   &lt;td&gt;26%&lt;/td&gt; 
   &lt;td&gt;16%&lt;/td&gt; 
   &lt;td&gt;9%&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;128&lt;/td&gt; 
   &lt;td&gt;1%&lt;/td&gt; 
   &lt;td&gt;4%&lt;/td&gt; 
   &lt;td&gt;19%&lt;/td&gt; 
   &lt;td&gt;8%&lt;/td&gt; 
   &lt;td&gt;5%&lt;/td&gt; 
   &lt;td&gt;7%&lt;/td&gt; 
   &lt;td&gt;14%&lt;/td&gt; 
   &lt;td&gt;20%&lt;/td&gt; 
   &lt;td&gt;22%&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td rowspan="5"&gt;*7 days&lt;/td&gt; 
   &lt;td&gt;8&lt;/td&gt; 
   &lt;td&gt;11%&lt;/td&gt; 
   &lt;td&gt;11%&lt;/td&gt; 
   &lt;td&gt;11%&lt;/td&gt; 
   &lt;td&gt;11%&lt;/td&gt; 
   &lt;td&gt;11%&lt;/td&gt; 
   &lt;td&gt;11%&lt;/td&gt; 
   &lt;td&gt;11%&lt;/td&gt; 
   &lt;td&gt;11%&lt;/td&gt; 
   &lt;td&gt;11%&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;16&lt;/td&gt; 
   &lt;td&gt;10%&lt;/td&gt; 
   &lt;td&gt;5%&lt;/td&gt; 
   &lt;td&gt;24%&lt;/td&gt; 
   &lt;td&gt;5%&lt;/td&gt; 
   &lt;td&gt;5%&lt;/td&gt; 
   &lt;td&gt;5%&lt;/td&gt; 
   &lt;td&gt;24%&lt;/td&gt; 
   &lt;td&gt;14%&lt;/td&gt; 
   &lt;td&gt;10%&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;32&lt;/td&gt; 
   &lt;td&gt;8%&lt;/td&gt; 
   &lt;td&gt;3%&lt;/td&gt; 
   &lt;td&gt;24%&lt;/td&gt; 
   &lt;td&gt;5%&lt;/td&gt; 
   &lt;td&gt;5%&lt;/td&gt; 
   &lt;td&gt;5%&lt;/td&gt; 
   &lt;td&gt;24%&lt;/td&gt; 
   &lt;td&gt;16%&lt;/td&gt; 
   &lt;td&gt;8%&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;64&lt;/td&gt; 
   &lt;td&gt;7%&lt;/td&gt; 
   &lt;td&gt;3%&lt;/td&gt; 
   &lt;td&gt;24%&lt;/td&gt; 
   &lt;td&gt;6%&lt;/td&gt; 
   &lt;td&gt;4%&lt;/td&gt; 
   &lt;td&gt;6%&lt;/td&gt; 
   &lt;td&gt;26%&lt;/td&gt; 
   &lt;td&gt;16%&lt;/td&gt; 
   &lt;td&gt;9%&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;**128&lt;/td&gt; 
   &lt;td&gt;1%&lt;/td&gt; 
   &lt;td&gt;4%&lt;/td&gt; 
   &lt;td&gt;19%&lt;/td&gt; 
   &lt;td&gt;8%&lt;/td&gt; 
   &lt;td&gt;5%&lt;/td&gt; 
   &lt;td&gt;7%&lt;/td&gt; 
   &lt;td&gt;14%&lt;/td&gt; 
   &lt;td&gt;20%&lt;/td&gt; 
   &lt;td&gt;22%&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td rowspan="5"&gt;14 days&lt;/td&gt; 
   &lt;td&gt;8&lt;/td&gt; 
   &lt;td&gt;11%&lt;/td&gt; 
   &lt;td&gt;11%&lt;/td&gt; 
   &lt;td&gt;11%&lt;/td&gt; 
   &lt;td&gt;11%&lt;/td&gt; 
   &lt;td&gt;11%&lt;/td&gt; 
   &lt;td&gt;11%&lt;/td&gt; 
   &lt;td&gt;11%&lt;/td&gt; 
   &lt;td&gt;11%&lt;/td&gt; 
   &lt;td&gt;11%&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;16&lt;/td&gt; 
   &lt;td&gt;10%&lt;/td&gt; 
   &lt;td&gt;5%&lt;/td&gt; 
   &lt;td&gt;24%&lt;/td&gt; 
   &lt;td&gt;5%&lt;/td&gt; 
   &lt;td&gt;5%&lt;/td&gt; 
   &lt;td&gt;5%&lt;/td&gt; 
   &lt;td&gt;24%&lt;/td&gt; 
   &lt;td&gt;14%&lt;/td&gt; 
   &lt;td&gt;10%&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;32&lt;/td&gt; 
   &lt;td&gt;8%&lt;/td&gt; 
   &lt;td&gt;3%&lt;/td&gt; 
   &lt;td&gt;24%&lt;/td&gt; 
   &lt;td&gt;5%&lt;/td&gt; 
   &lt;td&gt;5%&lt;/td&gt; 
   &lt;td&gt;5%&lt;/td&gt; 
   &lt;td&gt;24%&lt;/td&gt; 
   &lt;td&gt;16%&lt;/td&gt; 
   &lt;td&gt;8%&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;64&lt;/td&gt; 
   &lt;td&gt;7%&lt;/td&gt; 
   &lt;td&gt;3%&lt;/td&gt; 
   &lt;td&gt;24%&lt;/td&gt; 
   &lt;td&gt;6%&lt;/td&gt; 
   &lt;td&gt;4%&lt;/td&gt; 
   &lt;td&gt;6%&lt;/td&gt; 
   &lt;td&gt;26%&lt;/td&gt; 
   &lt;td&gt;16%&lt;/td&gt; 
   &lt;td&gt;9%&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;128&lt;/td&gt; 
   &lt;td&gt;1%&lt;/td&gt; 
   &lt;td&gt;4%&lt;/td&gt; 
   &lt;td&gt;19%&lt;/td&gt; 
   &lt;td&gt;8%&lt;/td&gt; 
   &lt;td&gt;5%&lt;/td&gt; 
   &lt;td&gt;7%&lt;/td&gt; 
   &lt;td&gt;14%&lt;/td&gt; 
   &lt;td&gt;20%&lt;/td&gt; 
   &lt;td&gt;22%&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td rowspan="5"&gt;30 days&lt;/td&gt; 
   &lt;td&gt;8&lt;/td&gt; 
   &lt;td&gt;11%&lt;/td&gt; 
   &lt;td&gt;11%&lt;/td&gt; 
   &lt;td&gt;11%&lt;/td&gt; 
   &lt;td&gt;11%&lt;/td&gt; 
   &lt;td&gt;11%&lt;/td&gt; 
   &lt;td&gt;11%&lt;/td&gt; 
   &lt;td&gt;11%&lt;/td&gt; 
   &lt;td&gt;11%&lt;/td&gt; 
   &lt;td&gt;11%&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;16&lt;/td&gt; 
   &lt;td&gt;10%&lt;/td&gt; 
   &lt;td&gt;5%&lt;/td&gt; 
   &lt;td&gt;24%&lt;/td&gt; 
   &lt;td&gt;5%&lt;/td&gt; 
   &lt;td&gt;5%&lt;/td&gt; 
   &lt;td&gt;5%&lt;/td&gt; 
   &lt;td&gt;24%&lt;/td&gt; 
   &lt;td&gt;14%&lt;/td&gt; 
   &lt;td&gt;10%&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;32&lt;/td&gt; 
   &lt;td&gt;8%&lt;/td&gt; 
   &lt;td&gt;3%&lt;/td&gt; 
   &lt;td&gt;24%&lt;/td&gt; 
   &lt;td&gt;5%&lt;/td&gt; 
   &lt;td&gt;5%&lt;/td&gt; 
   &lt;td&gt;5%&lt;/td&gt; 
   &lt;td&gt;24%&lt;/td&gt; 
   &lt;td&gt;16%&lt;/td&gt; 
   &lt;td&gt;8%&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;64&lt;/td&gt; 
   &lt;td&gt;7%&lt;/td&gt; 
   &lt;td&gt;3%&lt;/td&gt; 
   &lt;td&gt;24%&lt;/td&gt; 
   &lt;td&gt;6%&lt;/td&gt; 
   &lt;td&gt;4%&lt;/td&gt; 
   &lt;td&gt;6%&lt;/td&gt; 
   &lt;td&gt;26%&lt;/td&gt; 
   &lt;td&gt;16%&lt;/td&gt; 
   &lt;td&gt;9%&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;128&lt;/td&gt; 
   &lt;td&gt;1%&lt;/td&gt; 
   &lt;td&gt;4%&lt;/td&gt; 
   &lt;td&gt;19%&lt;/td&gt; 
   &lt;td&gt;8%&lt;/td&gt; 
   &lt;td&gt;5%&lt;/td&gt; 
   &lt;td&gt;7%&lt;/td&gt; 
   &lt;td&gt;14%&lt;/td&gt; 
   &lt;td&gt;20%&lt;/td&gt; 
   &lt;td&gt;22%&lt;/td&gt; 
  &lt;/tr&gt; 
 &lt;/tbody&gt; 
&lt;/table&gt; 
&lt;p&gt;For example, for configuration of *7 days concurrent run with **128 concurrency, the proposed configuration distributes Query 1 to Query 9 with appropriate weighted submissions such that Query 9 is executed the greatest number of times in the overall 128 executions submitted across all 9 queries for this run.&lt;/p&gt; 
&lt;h2&gt;JMeter setup&lt;/h2&gt; 
&lt;p&gt;To begin, you must set up JMeter on a machine that can handle the desired test load. An EC2 instance is a flexible and cost-effective option. Choose an instance type with sufficient vCPUs to support your maximum planned concurrency. For example, a c6i.4xlarge or higher is typically suitable for moderate to high throughput testing scenarios. For the operating system, you might choose Amazon Linux, which is optimized for AWS. For production-grade testing environments, deploy the JMeter EC2 instance in a private subnet of a virtual private cloud (VPC) with appropriate security groups that allow only required connections. This network isolation helps maintain security while executing performance tests. Consider using &lt;a href="https://aws.amazon.com/vpc" target="_blank" rel="noopener noreferrer"&gt;Amazon Virtual Private Cloud&lt;/a&gt; (Amazon VPC) endpoints for secure access to AWS services.&lt;/p&gt; 
&lt;p&gt;After the instance is provisioned, install Java (Java 17 LTS or Java 21 LTS) and download the latest version of JMeter. Be sure to configure the system with appropriate JVM options to allocate sufficient heap memory for large-scale test executions. Refer to &lt;a href="https://jmeter.apache.org/usermanual/get-started.html" target="_blank" rel="noopener noreferrer"&gt;Getting Started&lt;/a&gt; to learn more.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-sql"&gt;# Install Java
sudo yum update -y # For Amazon Linux
sudo yum install java-17-amazon-corretto -y

# Download JMeter and place the appropriate jdbc driver for the engine of your selection under lib folder
wget https://downloads.apache.org//jmeter/binaries/apache-jmeter-5.6.3.tgz
tar -xvzf apache-jmeter-5.6.3.tgz
cd apache-jmeter-5.6.3/lib

# Launch JMeter in GUI mode (if using a GUI-capable setup) or use CLI for remote testing
./bin/jmeter&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;h2&gt;JMeter concepts&lt;/h2&gt; 
&lt;p&gt;Before you create test plans in JMeter, it’s important to understand a few foundational concepts that influence how your test plan behaves—such as thread groups, user-defined variables, and JDBC connection. These components enable the simulation of real-world query loads, including concurrency and pacing.&lt;/p&gt; 
&lt;h3&gt;Test plans&lt;/h3&gt; 
&lt;p&gt;The test plan is the top-level container for a JMeter test. It defines the overall testing strategy, including the queries to execute, their parameters, and the concurrent user behavior. These plans are represented as jmx files that can then be used for CLI-based execution. JMeter supports both GUI and CLI modes. It is highly recommended that you use the JMeter GUI primarily for creating test plans as jmx, and use the CLI for large load tests. You can also run thread groups consecutively for sequential execution. The default behavior is to run all thread groups in parallel suited for concurrent execution. Refer to &lt;a href="https://jmeter.apache.org/usermanual/build-test-plan.html" target="_blank" rel="noopener noreferrer"&gt;Building a Test Plan&lt;/a&gt; to learn more about options available with test plans.&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignleft wp-image-83931 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2025/09/30/image-1-29.png" alt="" width="1532" height="750"&gt;&lt;/p&gt; 
&lt;h3&gt;&lt;/h3&gt; 
&lt;h3&gt;&lt;/h3&gt; 
&lt;h3&gt;User-defined variables&lt;/h3&gt; 
&lt;p&gt;User-defined variables are global parameters that you can reuse throughout the test plan. They are helpful for defining database credentials, server URLs, or query parameters. For example:&lt;code&gt;DB_URL=jdbc:trino://trino-cluster.example.com:8889?SSL=true #Enable SSL/TLS&lt;/code&gt;&lt;/p&gt; 
&lt;p&gt;You can configure authentication (user name and password) through your organization’s approved methods, such as &lt;a href="https://aws.amazon.com/secrets-manager/" target="_blank" rel="noopener noreferrer"&gt;AWS Secrets Manager&lt;/a&gt; (see &lt;a href="https://docs.aws.amazon.com/secretsmanager/latest/userguide/hardcoded.html" target="_blank" rel="noopener noreferrer"&gt;Move hardcoded secrets to AWS Secrets Manager&lt;/a&gt;) &lt;a href="https://aws.amazon.com/iam/" target="_blank" rel="noopener noreferrer"&gt;AWS Identity and Access Management&lt;/a&gt; (IAM) roles, or other secure credential management systems.&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignleft wp-image-83932 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2025/09/30/image-2-23.png" alt="" width="1535" height="414"&gt;&lt;/p&gt; 
&lt;h3&gt;&lt;/h3&gt; 
&lt;h3&gt;&lt;/h3&gt; 
&lt;h3&gt;Thread groups&lt;/h3&gt; 
&lt;p&gt;A thread group represents a group of virtual users (threads) executing test actions. Each thread simulates a single user sending requests to the SQL engine. This can be used to simulate concurrent runs. For example, in the preceding template, Query 3 has 19% weightage across 128 runs. This means .19*128=25 total runs, so we set the thread group to 25.&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignleft wp-image-83933 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2025/09/30/image-3-25.png" alt="" width="1530" height="522"&gt;&lt;/p&gt; 
&lt;h3&gt;&lt;/h3&gt; 
&lt;h3&gt;&lt;/h3&gt; 
&lt;h3&gt;JDBC connection configuration&lt;/h3&gt; 
&lt;p&gt;JDBC connection configuration sets up the database connection for the test. It specifies the database URL, driver, and credentials required for executing SQL queries. Key fields to configure are database URL and JDBC driver class. The following table summarizes the different configuration settings.&lt;/p&gt; 
&lt;p&gt;&amp;nbsp;&lt;/p&gt; 
&lt;table class="styled-table" border="1px" cellpadding="10px"&gt; 
 &lt;tbody&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;strong&gt;SQL Engine&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;JDBC Driver&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;JDBC Driver Class&lt;/strong&gt;&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Trino on EMR&lt;/td&gt; 
   &lt;td&gt;&lt;code&gt;trino-jdbc-&amp;lt;trino_version&amp;gt;-amzn-0.jar&lt;/code&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;code&gt;io.trino.jdbc.TrinoDriver&lt;/code&gt;&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Athena&lt;/td&gt; 
   &lt;td&gt;&lt;a href="https://docs.aws.amazon.com/athena/latest/ug/jdbc-v3-driver.html" target="_blank" rel="noopener noreferrer"&gt;Athena JDBC 3.x driver&lt;/a&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;code&gt;com.amazon.athena.jdbc.AthenaDriver&lt;/code&gt;&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Amazon Redshift&lt;/td&gt; 
   &lt;td&gt;&lt;a href="https://docs.aws.amazon.com/redshift/latest/mgmt/jdbc20-download-driver.html" target="_blank" rel="noopener noreferrer"&gt;Amazon Redshift JDBC driver&lt;/a&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;code&gt;com.amazon.redshift.jdbc.Driver&lt;/code&gt;&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Trino on EKS&lt;/td&gt; 
   &lt;td&gt;&lt;a href="https://repo1.maven.org/maven2/io/trino/trino-jdbc/" target="_blank" rel="noopener noreferrer"&gt;Trino JDBC driver&lt;/a&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;code&gt;io.trino.jdbc.TrinoDriver&lt;/code&gt;&lt;/td&gt; 
  &lt;/tr&gt; 
 &lt;/tbody&gt; 
&lt;/table&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignleft wp-image-83934 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2025/09/30/image-4-15.png" alt="" width="1534" height="833"&gt;&lt;/p&gt; 
&lt;h3&gt;&lt;/h3&gt; 
&lt;h3&gt;&lt;/h3&gt; 
&lt;h3&gt;JDBC requests&lt;/h3&gt; 
&lt;p&gt;The JDBC request executes SQL queries against the database using the configuration defined in the JDBC connection configuration.&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignleft wp-image-83935 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2025/09/30/image-5-14.png" alt="" width="1529" height="829"&gt;&lt;/p&gt; 
&lt;p&gt;For example, following command runs the JMeter in CLI mode:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;# Run benchmarks in CLI mode 
./jmeter -n -t &amp;lt;path_to&amp;gt;.jmx -l &amp;lt;local path for log&amp;gt;.log -e -o &amp;lt;local path for&amp;gt;/output/&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;The output folder will contain an HTML report with different statistics. The following screenshot illustrates 128 concurrent runs.&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignleft wp-image-83936 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2025/09/30/image-6-13.png" alt="" width="1805" height="479"&gt;&lt;/p&gt; 
&lt;h2&gt;Monitoring and logging&lt;/h2&gt; 
&lt;p&gt;For comprehensive visibility and audit requirements, enable &lt;a href="https://aws.amazon.com/cloudtrail" target="_blank" rel="noopener noreferrer"&gt;AWS CloudTrail&lt;/a&gt; logging, &lt;a href="https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html" target="_blank" rel="noopener noreferrer"&gt;VPC Flow Logs&lt;/a&gt;, and service-specific logs (like Amazon S3 access logs). These logs can be centralized in &lt;a href="https://aws.amazon.com/cloudwatch" target="_blank" rel="noopener noreferrer"&gt;Amazon CloudWatch Logs&lt;/a&gt; for monitoring and analysis. This provides proper audit trails while evaluating different SQL engines and helps track access patterns and potential security events.&lt;/p&gt; 
&lt;h2&gt;Post-test steps&lt;/h2&gt; 
&lt;p&gt;After running your JMeter tests, proceed with the following steps:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;Review the HTML report’s key metrics, including response times, throughput, and error rates across different query types and concurrency levels.&lt;/li&gt; 
 &lt;li&gt;Run identical test plans across your candidate SQL engines for direct performance comparison.&lt;/li&gt; 
 &lt;li&gt;Refine your test plans based on initial findings, focusing on areas where performance differences are significant.&lt;/li&gt; 
 &lt;li&gt;Factor in the cost implications alongside performance metrics to make a balanced decision.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;These steps can help you systematically evaluate and select the most suitable SQL engine for your analytics workloads.&lt;/p&gt; 
&lt;h2&gt;Resources&lt;/h2&gt; 
&lt;p&gt;In the preceding steps, we walked through a UI-based setup for JMeter along with test plans. We have created a few sample JMeter test plans for both sequential and concurrent runs along with sample test reports. You can modify the plans to fit your needs.&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;JMeter &lt;a href="https://d2908q01vomqb2.cloudfront.net/artifacts/DBSBlogs/BDB-4497/index.html"&gt;sample report&lt;/a&gt;&lt;/li&gt; 
 &lt;li&gt;JMeter &lt;a href="https://d2908q01vomqb2.cloudfront.net/artifacts/DBSBlogs/BDB-4497/sample_trino_30d_single_thread.jmx"&gt;test plan&lt;/a&gt; for sequential run&lt;/li&gt; 
 &lt;li&gt;JMeter &lt;a href="https://d2908q01vomqb2.cloudfront.net/artifacts/DBSBlogs/BDB-4497/sample_trino_7d_128parallel_queries.jmx"&gt;test plan&lt;/a&gt; for concurrent run&lt;/li&gt; 
&lt;/ol&gt; 
&lt;h2&gt;Clean up&lt;/h2&gt; 
&lt;p&gt;After you complete your benchmarking process, clean up the resources to avoid unnecessary costs:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;Stop or delete the EC2 instances used for running JMeter.&lt;/li&gt; 
 &lt;li&gt;Depending on which SQL engines you used for testing, clean up active resources.&lt;/li&gt; 
 &lt;li&gt;Review your AWS Management Console to confirm no active resources remain.&lt;/li&gt; 
 &lt;li&gt;If you created test datasets in Amazon S3 or other storage services specifically for this benchmarking, consider deleting them if they’re no longer needed.&lt;/li&gt; 
 &lt;li&gt;Although JMeter test plans and results don’t incur AWS costs, organize or delete local files as needed for your record-keeping.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;h2&gt;Summary&lt;/h2&gt; 
&lt;p&gt;Selecting the right SQL processing solution for large-scale analytics demands a systematic, data-driven approach. Our JMeter framework can help organizations effectively evaluate different SQL engines by simulating real-world workload patterns across various query types, data volumes, and concurrency levels. This methodology reduces proof-of-concept cycles and provides insights beyond traditional benchmarks, helping you assess managed AWS services like Athena and Amazon Redshift and open source solutions on Amazon EKS.&lt;/p&gt; 
&lt;hr&gt; 
&lt;h3&gt;About the authors&lt;/h3&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="aligncenter size-full wp-image-29797" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2025/09/30/pic_headshot.jpeg" alt="Anubhav Awasthi" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Anubhav Awasthi&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/anubhavawasthi/" target="_blank" rel="noopener"&gt;Anubhav&lt;/a&gt; is a Senior Big Data Specialist Solutions Architect at Amazon Web Services (AWS). He collaborates with customers to provide expert architectural guidance for implementing and optimizing analytics solutions using Amazon EMR, Amazon Athena, AWS Glue, and AWS Lake Formation.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="aligncenter size-full wp-image-29797" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2025/09/30/brahmi-badge-high-resolution-100x133.jpeg" alt="Gagan Brahmi" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Gagan Brahmi&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/gaganbrahmi/" target="_blank" rel="noopener"&gt;Gagan&lt;/a&gt; is a Specialist Senior Solutions Architect at Amazon Web Services (AWS), focused on Data Analytics and AI/ML. With over 20 years in information technology, he partners with customers to solve complex AI/ML challenges by leveraging data and AI/ML platforms. Gagan helps customers architect scalable, high-performance solutions that utilize distributed data processing, real-time streaming technologies, and AI/ML services to drive business transformation through artificial intelligence and data-driven insights. When not designing cloud-native data and AI solutions, Gagan enjoys exploring new places with his family.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="aligncenter size-full wp-image-29797" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2025/09/30/JP-Profile-Pic-1-100x98.jpg" alt="Jayaprakash Boreddy" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Jayaprakash Boreddy&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/jayaprakash-boreddy/" target="_blank" rel="noopener"&gt;Jayaprakash&lt;/a&gt; is a Senior Solutions Architect at AWS. He works with ISV customers in designing and building highly scalable, flexible and resilient applications on AWS Cloud.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="aligncenter size-full wp-image-29797" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2025/09/30/Picture1-39-100x131.png" alt="Sahil Thapar" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Sahil Thapar&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/sahil-thapar-71657b77/" target="_blank" rel="noopener"&gt;Sahil&lt;/a&gt; is a Principal Solutions Architect. He works with ISV customers to help them build highly available, scalable, and resilient applications on the AWS Cloud.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
					
			
		
		
			</item>
		<item>
		<title>Build petabyte-scale synthetic test data with Amazon EMR on EC2</title>
		<link>https://aws.amazon.com/blogs/big-data/build-petabyte-scale-synthetic-test-data-with-amazon-emr-on-ec2/</link>
		
		<dc:creator><![CDATA[Anubhav Awasthi]]></dc:creator>
		<pubDate>Tue, 19 May 2026 15:42:49 +0000</pubDate>
				<category><![CDATA[Advanced (300)]]></category>
		<category><![CDATA[Amazon EMR]]></category>
		<category><![CDATA[Analytics]]></category>
		<category><![CDATA[Technical How-to]]></category>
		<guid isPermaLink="false">52451d2b23fa16a52e8cca4cc77adab889dfb81a</guid>

					<description>As data volumes grow from terabytes to petabytes, the architecture for generating synthetic data must evolve to meet increasing demands for scale, performance, and data quality. In this post, we show how you can build a scalable synthetic data generation solution using Amazon EMR, Apache Spark, and the Faker library.</description>
										<content:encoded>&lt;p&gt;As you scale your data systems, you face a challenge: how to test thoroughly without putting customer data at risk. Using production data for testing can expose sensitive customer information to unauthorized access or breaches. For customers in regulated industries like finance and healthcare, this risk isn’t only a concern. It’s unacceptable. A data breach during testing could compromise their privacy, damage their trust, and expose organizations to significant compliance penalties. Synthetic test data solves this problem by generating artificial datasets that replicate the structure and patterns of real data without containing any actual customer information. This approach means you can test performance, validate data pipelines, and develop new features while ensuring that customer data remains protected and compliance requirements are met.&lt;/p&gt; 
&lt;p&gt;As data volumes grow from terabytes to petabytes, the architecture for generating synthetic data must evolve to meet increasing demands for scale, performance, and data quality. In this post, we show how you can build a scalable synthetic data generation solution using &lt;a href="https://aws.amazon.com/emr/" target="_blank" rel="noopener"&gt;Amazon EMR&lt;/a&gt;, &lt;a href="https://spark.apache.org/" target="_blank" rel="noopener"&gt;Apache Spark&lt;/a&gt;, and the &lt;a href="https://faker.readthedocs.io/" target="_blank" rel="noopener"&gt;Faker library&lt;/a&gt;.&lt;/p&gt; 
&lt;h2&gt;The challenge of synthetic data generation&lt;/h2&gt; 
&lt;p&gt;Traditional benchmark datasets like &lt;a href="https://www.tpc.org/TPC_Documents_Current_Versions/pdf/TPC-DS_v4.0.0.pdf" target="_blank" rel="noopener"&gt;TPC-DS&lt;/a&gt; provide standardized schemas and predetermined data volumes for consistent testing environments across different systems. However, they fall short in meeting real-world testing requirements. These benchmarks don’t capture &lt;strong&gt;industry-specific patterns or the complex relationships&lt;/strong&gt; found in actual production data. Their &lt;strong&gt;rigid schemas and simplified distributions&lt;/strong&gt; fail to reflect business requirements, and &lt;strong&gt;scaling&lt;/strong&gt; them while maintaining data consistency proves difficult. Perhaps most critically, generating massive datasets with traditional approaches requires specialized architectures to avoid proportional increases in &lt;strong&gt;compute costs and time&lt;/strong&gt;.&lt;/p&gt; 
&lt;h2&gt;Requirements for production-grade synthetic data&lt;/h2&gt; 
&lt;p&gt;Effective workload validation demands synthetic data that &lt;strong&gt;mirrors production distributions&lt;/strong&gt; while &lt;strong&gt;maintaining referential integrity&lt;/strong&gt; across related tables and entities. The generation process must &lt;strong&gt;scale horizontally&lt;/strong&gt; to accommodate growing data volumes while &lt;strong&gt;delivering deterministic results&lt;/strong&gt;. Given identical input parameters, the system should produce the same dataset across multiple runs, supporting consistent testing cycles and comparative analysis.&lt;/p&gt; 
&lt;p&gt;Beyond technical requirements, synthetic data addresses &lt;strong&gt;compliance&lt;/strong&gt; needs by minimizing exposure of &lt;em&gt;personally identifiable information (PII)&lt;/em&gt; and &lt;em&gt;protected health information (PHI)&lt;/em&gt; in non-production environments. This approach satisfies &lt;em&gt;GDPR, HIPAA, and CCPA&lt;/em&gt; requirements while supporting secure cross-border data transfer, regular stress testing without compromising sensitive information, and providing an audit-friendly alternative to data masking that preserves analytical properties.&lt;/p&gt; 
&lt;h2&gt;Solution overview&lt;/h2&gt; 
&lt;p&gt;Architecting a synthetic data generation system that scales from terabytes to petabytes requires balancing several competing demands: the system must scale horizontally while maintaining data quality, generate large volumes efficiently, manage compute and storage resources cost-effectively, and support various schemas and output formats.&lt;/p&gt; 
&lt;p&gt;Our architecture addresses these challenges through four core components. &lt;strong&gt;Apache Spark on Amazon EMR&lt;/strong&gt; provides the distributed computing framework necessary for large-scale generation. The &lt;strong&gt;Faker library&lt;/strong&gt; offers synthetic data generation functions that integrate with Spark. &lt;strong&gt;&lt;a href="http://aws.amazon.com/s3" target="_blank" rel="noopener"&gt;Amazon Simple Storage Service (Amazon S3)&lt;/a&gt; with Apache Iceberg&lt;/strong&gt; serves as the storage layer. We chose Iceberg for its schema and partition evolution capabilities without data rewrites, atomic transactions for consistency, precise time travel features for reproducible testing, and optimized performance at extreme scale. Amazon EMR handles dynamic &lt;strong&gt;resource allocation and cluster management&lt;/strong&gt;.&lt;/p&gt; 
&lt;p&gt;The following diagram illustrates the solution architecture.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/13/BDB-4496-1.png" alt="Solution architecture showing Amazon EMR generating synthetic data with Apache Spark and Faker, storing results in Amazon S3 with Apache Iceberg" width="600"&gt;&lt;/p&gt; 
&lt;h2&gt;Synthetic data generation at scale with Amazon EMR&lt;/h2&gt; 
&lt;p&gt;Amazon EMR emerges as a particularly powerful solution for this use case, offering several advantages that directly address our requirements. It facilitates scaling of compute resources through instance fleets and Spot Instances, which can reduce costs by up to 90% compared to On-Demand pricing. The service provides built-in performance optimization for Spark applications with real-time monitoring through &lt;a href="http://aws.amazon.com/cloudwatch" target="_blank" rel="noopener"&gt;Amazon CloudWatch&lt;/a&gt; integration.&lt;/p&gt; 
&lt;p&gt;The managed infrastructure reduces operational overhead by handling the underlying Spark ecosystem and cluster lifecycle, while still providing control over scaling policies, instance types, and configurations. Integration with Amazon S3, &lt;a href="https://aws.amazon.com/glue" target="_blank" rel="noopener"&gt;AWS Glue&lt;/a&gt;, and &lt;a href="http://aws.amazon.com/athena" target="_blank" rel="noopener"&gt;Amazon Athena&lt;/a&gt; facilitates end-to-end data generation and testing workflows. Support for multiple programming languages and notebooks provides flexibility in implementing generation logic tailored to specific testing scenarios.&lt;/p&gt; 
&lt;p&gt;The synthetic data generation process follows a systematic approach designed for efficiency and scalability, as illustrated in the following diagram.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/13/BDB-4496-2.png" alt="Synthetic data generation workflow showing the systematic process from configuration through data generation to storage" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;Although synthetic data generation isn’t a sensitive workload, it’s important to maintain robust security throughout the data generation process. Amazon EMR provides security features that align with organizational compliance requirements.&lt;/p&gt; 
&lt;p&gt;For comprehensive security guidance specific to Amazon EMR deployments, refer to &lt;a href="https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-security.html" target="_blank" rel="noopener"&gt;Security in Amazon EMR&lt;/a&gt;. The solution follows the AWS Shared Responsibility Model, where AWS manages the security of the cloud infrastructure, and customers maintain responsibility for data security, access management, and compliance controls in the cloud. Specifically for synthetic data generation workloads, AWS manages the security of the underlying Amazon EMR infrastructure, network, and service operations, and customers implement appropriate security controls for their data generation pipelines. Consider the following key areas:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Data protection&lt;/strong&gt; – Enable encryption at rest and in transit using Amazon EMR security configurations, including Amazon S3 encryption and TLS certificates for inter-node communication.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Network security&lt;/strong&gt; – Deploy Amazon EMR clusters in private subnets with security groups following least privilege, and enable the Amazon EMR block public access feature.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Access control&lt;/strong&gt; – Implement &lt;a href="https://aws.amazon.com/iam/" target="_blank" rel="noopener"&gt;AWS Identity and Access Management&lt;/a&gt; (IAM) roles with least privilege for Amazon EMR service roles, &lt;a href="http://aws.amazon.com/ec2" target="_blank" rel="noopener"&gt;Amazon Elastic Compute Cloud&lt;/a&gt; (Amazon EC2) instance profiles, and runtime roles to isolate job access. Fine-grained table-level and column-level permissions can be controlled using &lt;a href="https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-lake-formation.html" target="_blank" rel="noopener"&gt;AWS Lake Formation&lt;/a&gt;. Additional authentication options are available using Kerberos and LDAP.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;Optimize Faker for petabyte-scale data generation&lt;/h2&gt; 
&lt;p&gt;When generating synthetic data at petabyte scale, using Faker’s implementations can quickly lead to performance bottlenecks. To overcome these limitations, adopt a combination of different optimization approaches instead of the default setup. Some of the approaches we adopted in this scenario are discussed in this section.&lt;/p&gt; 
&lt;h3&gt;Faker instance pooling&lt;/h3&gt; 
&lt;p&gt;The following code creates multiple Faker instances to avoid contention when generating data in parallel:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-python"&gt;NUM_FAKER_INSTANCES = 10
faker_pool = [Faker() for _ in range(NUM_FAKER_INSTANCES)]&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;h3&gt;Consistent seed management&lt;/h3&gt; 
&lt;p&gt;The following code provides reproducible data generation across distributed executors:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-python"&gt;for faker in faker_pool:
    faker.seed_instance(42)  # For reproducibility
    random.seed(42)&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;h3&gt;Random access to Faker pool&lt;/h3&gt; 
&lt;p&gt;The following code distributes load across multiple Faker instances to reduce contention:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-python"&gt;faker = faker_pool[random.randint(0, NUM_FAKER_INSTANCES-1)]&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;h3&gt;Broadcast variables for reference data&lt;/h3&gt; 
&lt;p&gt;The following code efficiently distributes reference data to all executors:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-python"&gt;tenant_ids_broadcast = spark.sparkContext.broadcast(tenant_ids)
protocols_bc = spark.sparkContext.broadcast(protocols)&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;h3&gt;Batch generation of synthetic data&lt;/h3&gt; 
&lt;p&gt;The following code generates fake data in batches rather than one-by-one:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-python"&gt;return spark.range(1, num_endpoints + 1)
    .withColumn("hostname", random_hostname_udf())&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;h3&gt;ThreadPoolExecutor for parallel processing&lt;/h3&gt; 
&lt;p&gt;The following code uses Python’s threading for parallel operations within executors:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-python"&gt;def parallel_write_with_sync(dataframe_configs, max_workers=3):
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        # Parallel processing&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;h2&gt;Optimize Amazon EMR and Spark&lt;/h2&gt; 
&lt;p&gt;When processing massive datasets with Spark on Amazon EMR, carefully tuning configurations can substantially enhance performance beyond the standard settings. In this section, we discuss ways to optimize the execution environment, so you can efficiently handle petabyte-scale workloads with synthetic data generation. By strategically using Spark’s advanced features and configuring Amazon EMR for your specific use case, you can improve throughput, reduce processing time, and maximize resource utilization.&lt;/p&gt; 
&lt;h3&gt;Arrow configuration&lt;/h3&gt; 
&lt;p&gt;The following code enables Apache Arrow for efficient data transfer between Python and JVM. The default value is false.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-python"&gt;.config("spark.sql.execution.arrow.pyspark.enabled", "true")&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Enable this configuration when your PySpark application frequently converts data between Python and JVM, especially for large DataFrames or when using Pandas operations. Keep this setting disabled for pure Spark SQL workloads or when memory is constrained.&lt;/p&gt; 
&lt;p&gt;This optimization is most effective in the following scenarios:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;When processing large-scale datasets that require frequent conversion between Python and JVM.&lt;/li&gt; 
 &lt;li&gt;In a PySpark application where large DataFrame operations and Pandas integration are needed.&lt;/li&gt; 
 &lt;li&gt;With data science workloads that combine Python UDFs with Spark SQL operations.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;Consider the following trade-offs:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Arrow maintains in-memory columnar format, resulting in increased memory consumption.&lt;/li&gt; 
 &lt;li&gt;Not all data types are fully supported in older versions of Spark.&lt;/li&gt; 
 &lt;li&gt;It might introduce overhead for very small datasets where conversion costs outweigh the benefits.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h3&gt;Adaptive query execution&lt;/h3&gt; 
&lt;p&gt;The following code allows Spark to dynamically optimize query execution plans. The default value is true in Spark 3.2 and later, and false in earlier versions.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-python"&gt;.config("spark.sql.adaptive.enabled", "true")&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;This optimization is generally recommended to keep enabled for most workloads. Consider disabling only when you have highly optimized, predictable queries where the adaptive overhead isn’t beneficial, or when troubleshooting query performance issues.&lt;/p&gt; 
&lt;p&gt;This optimization is most effective in the following scenarios:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Complex join operations with unknown or skewed data distributions.&lt;/li&gt; 
 &lt;li&gt;Multi-stage queries where initial plans might be suboptimal.&lt;/li&gt; 
 &lt;li&gt;When processing data with changing characteristics over time.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;Consider the following trade-offs:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;You may experience additional overhead during the query planning phase.&lt;/li&gt; 
 &lt;li&gt;You might occasionally choose suboptimal plans for certain edge cases.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h3&gt;Parallelism configuration&lt;/h3&gt; 
&lt;p&gt;The following code sets appropriate parallelism for distributed data processing based on the volume of data you’re generating. The default value for &lt;code&gt;spark.default.parallelism&lt;/code&gt; is the total number of cores on all executor nodes or 2, whichever larger. The default value for &lt;code&gt;spark.sql.shuffle.partitions&lt;/code&gt; is 200.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-python"&gt;.config("spark.default.parallelism", 1000)
.config("spark.sql.shuffle.partitions", 1000)&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Adjust this configuration when the default of 200 shuffle partitions creates too many small tasks (increase data volume) or too few large tasks (decrease for smaller datasets). Generally, aim for partition sizes of 100–200 MB. Modify &lt;code&gt;default.parallelism&lt;/code&gt; when your RDD operations need different parallelism than the CPU-based default.&lt;/p&gt; 
&lt;p&gt;This optimization is most effective in the following scenarios:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;When generating consistent volumes of synthetic data across multiple runs.&lt;/li&gt; 
 &lt;li&gt;When you have predictable resource requirements.&lt;/li&gt; 
 &lt;li&gt;When you need to precisely control executor utilization.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;Consider the following trade-offs:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Static configuration might not adapt well to varying data volumes.&lt;/li&gt; 
 &lt;li&gt;Too many partitions can lead to task scheduling overhead.&lt;/li&gt; 
 &lt;li&gt;Too few partitions might cause memory pressure on executors.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h3&gt;Memory management&lt;/h3&gt; 
&lt;p&gt;The following code optimizes memory allocation for execution and storage. The default value for &lt;code&gt;spark.memory.fraction&lt;/code&gt; is 0.6, and for &lt;code&gt;spark.memory.storageFraction&lt;/code&gt; is 0.5.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-python"&gt;.config("spark.memory.fraction", 0.8)
.config("spark.memory.storageFraction", 0.3)&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Increase &lt;code&gt;memory.fraction&lt;/code&gt; from 0.6 to 0.8 when your workload is memory-intensive and you’re not using the JVM heap for other purposes. Adjust &lt;code&gt;storageFraction&lt;/code&gt; based on your caching vs.&amp;nbsp;execution memory needs. Decrease to 0.3 if you do minimal caching but have complex computations, and increase to 0.7 or higher for cache-heavy workloads.&lt;/p&gt; 
&lt;p&gt;This optimization is most effective in the following scenarios:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Workloads that are memory-intensive and need fine-grained control.&lt;/li&gt; 
 &lt;li&gt;Workloads that balance between execution memory and cached data.&lt;/li&gt; 
 &lt;li&gt;During synthetic data generation that has many interdependent fields.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;Consider the following trade-offs:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Incorrect memory configuration can lead to frequent spills to disk or out-of-memory (OOM) errors.&lt;/li&gt; 
 &lt;li&gt;You might need to change the configuration to suit different workload characteristics.&lt;/li&gt; 
 &lt;li&gt;The settings must be monitored and tuned for optimal performance.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h3&gt;Limited Python UDF usage&lt;/h3&gt; 
&lt;p&gt;The following code uses Spark’s built-in functions where possible instead of Python user-defined functions (UDFs). No additional configuration is needed. This is a coding practice.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-python"&gt;.withColumn("risk_score", F.round(F.rand() * 9 + 1, 2).cast(DecimalType(3, 2)))&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;We recommend using Spark functions over Python UDFs when the same functionality can be achieved. Use Python UDFs only when complex business logic can’t be expressed using Spark’s built-in functions, or when integrating with specialized Python libraries.&lt;/p&gt; 
&lt;p&gt;This optimization is most effective in the following scenarios:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Simple transformations that can be performed using Spark functions.&lt;/li&gt; 
 &lt;li&gt;High-throughput workloads where serialization overhead needs to be minimized.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;Consider the following trade-offs:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;This approach is less flexible compared to customer Python-based transformations or functions.&lt;/li&gt; 
 &lt;li&gt;You might need to use complex expressions to accomplish certain data patterns.&lt;/li&gt; 
 &lt;li&gt;There is a potential learning curve to familiarize yourself with Spark functions.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h3&gt;DataFrame caching&lt;/h3&gt; 
&lt;p&gt;The following code caches frequently used DataFrames to avoid regenerating data. The default behavior doesn’t use caching. DataFrames are recomputed on each action.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-python"&gt;endpoints_df = generate_endpoints().cache()&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Use this optimization to cache DataFrames that are accessed multiple times in your application. Monitor memory usage and use MEMORY_AND_DISK storage level for large DataFrames. Uncache DataFrames when they’re no longer needed to free memory.&lt;/p&gt; 
&lt;p&gt;This optimization is most effective in the following scenarios:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;When reusing reference data across multiple operations (can result in performance gains).&lt;/li&gt; 
 &lt;li&gt;For workloads where the same data is processed on multiple occasions.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;Consider the following trade-offs:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Too much caching might lead to memory process.&lt;/li&gt; 
 &lt;li&gt;Planning is required to manage cache in environments where memory is scarce.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h3&gt;Optimal partitioning&lt;/h3&gt; 
&lt;p&gt;By default, Spark determines partitioning based on input data and previous operations. The following code makes sure data is properly distributed across executors:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-python"&gt;.repartition(20)&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Use &lt;code&gt;repartition()&lt;/code&gt; when you need to increase partitions for better parallelism or support even data distribution. Use &lt;code&gt;coalesce()&lt;/code&gt; when reducing partitions to avoid small files. Generally, target 100–200 MB per partition for optimal performance.&lt;/p&gt; 
&lt;p&gt;This optimization is most effective in the following scenarios:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;When controlling data distribution and avoiding data skew is very important.&lt;/li&gt; 
 &lt;li&gt;Before executing an expensive operation that will benefit from balanced data distribution.&lt;/li&gt; 
 &lt;li&gt;When optimizing downstream consumption use cases.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;Consider the following trade-offs:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;This option is more expensive than &lt;code&gt;coalesce()&lt;/code&gt;. For large datasets, &lt;code&gt;repartition()&lt;/code&gt; can lead to large shuffle.&lt;/li&gt; 
 &lt;li&gt;The approach requires trial and experimentation to determine the optimal partition count.&lt;/li&gt; 
 &lt;li&gt;There is no “one-size-fits-all” setting. Different applications or operations might gain performance with different partitioning.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h3&gt;Partition-aware writing&lt;/h3&gt; 
&lt;p&gt;By default, data is written without partitioning. The following code organizes data for efficient storage and retrieval:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-python"&gt;{"df": network_events_df, "name": "network_events", "partition_cols": ["tenant_id"]}&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Partition data when you have predictable query patterns that filter on specific columns. Choose partition columns that are frequently used in WHERE clauses and have reasonable cardinality (avoid too many small partitions or too few large ones).&lt;/p&gt; 
&lt;p&gt;This optimization offers the following benefits:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Allows for highly parallel write operation across multiple executors.&lt;/li&gt; 
 &lt;li&gt;Organizes the data that is close to real-world production data.&lt;/li&gt; 
 &lt;li&gt;Allows for partition pruning when querying the data.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;Consider the following trade-offs:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Excess partitioning or too fine-grained partitioning might result in small files.&lt;/li&gt; 
 &lt;li&gt;It might result in data skew because of hot partitions.&lt;/li&gt; 
 &lt;li&gt;You might encounter storage and metadata overhead because of excessive partitions.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;Best practices&lt;/h2&gt; 
&lt;p&gt;Through our journey from terabytes to petabytes, we’ve identified several best practices:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Begin with a modest dataset and incrementally scale, allowing for identification of bottlenecks at each stage.&lt;/li&gt; 
 &lt;li&gt;Implement robust data validation checks to confirm synthetic data maintains expected properties at scale.&lt;/li&gt; 
 &lt;li&gt;Regularly review and adjust Amazon EMR configurations, using Spot Instances and right-sizing clusters.&lt;/li&gt; 
 &lt;li&gt;Develop parameterized job scripts that can adjust data volume, complexity, and cluster resources dynamically.&lt;/li&gt; 
 &lt;li&gt;Design your synthetic data schema and generation logic to quickly accommodate new fields or changing distributions over time.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;Conclusion&lt;/h2&gt; 
&lt;p&gt;Our journey from terabytes to petabytes of synthetic data generation demonstrates how Amazon EMR, combined with Spark and Faker, can effectively address large-scale testing needs. The architecture we explored in this post scales to meet demanding data generation requirements while maintaining data quality and cost-efficiency.&lt;/p&gt; 
&lt;p&gt;We showed how starting with a solid foundation at terabyte scale, then gradually expanding through Amazon EMR managed services and Spot Instances, helps organizations build robust synthetic data pipelines. The combination of efficient data generation techniques, proper validation, and continuous monitoring provides reliable results at scale.&lt;/p&gt; 
&lt;p&gt;To begin implementing your own synthetic data generation system, start small, test thoroughly, and scale incrementally. For implementation guidance, refer to &lt;a href="https://repost.aws/articles/ARUUyEmZiKSSm2XJq2p_63HA/generate-production-grade-synthetic-data-at-petabyte-scale-using-apache-spark-and-faker-on-amazon-emr" target="_blank" rel="noopener"&gt;Generate production-grade synthetic data at petabyte-scale using Apache Spark and Faker on Amazon EMR&lt;/a&gt;.&lt;/p&gt; 
&lt;hr style="width: 80%"&gt; 
&lt;h3&gt;About the authors&lt;/h3&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="aligncenter size-full wp-image-29797" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2025/09/30/pic_headshot.jpeg" alt="Anubhav Awasthi" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Anubhav Awasthi&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/anubhavawasthi/" target="_blank" rel="noopener"&gt;Anubhav&lt;/a&gt; is a Senior Big Data Specialist Solutions Architect at Amazon Web Services (AWS). He collaborates with customers to provide expert architectural guidance for implementing and optimizing analytics solutions using Amazon EMR, Amazon Athena, AWS Glue, and AWS Lake Formation.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="aligncenter size-full wp-image-29797" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2025/09/30/brahmi-badge-high-resolution-100x133.jpeg" alt="Gagan Brahmi" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Gagan Brahmi&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/gaganbrahmi/" target="_blank" rel="noopener"&gt;Gagan&lt;/a&gt; is a Specialist Senior Solutions Architect at Amazon Web Services (AWS), focused on Data Analytics and AI/ML. With over 20 years in information technology, he partners with customers to solve complex AI/ML challenges by leveraging data and AI/ML platforms. Gagan helps customers architect scalable, high-performance solutions that utilize distributed data processing, real-time streaming technologies, and AI/ML services to drive business transformation through artificial intelligence and data-driven insights. When not designing cloud-native data and AI solutions, Gagan enjoys exploring new places with his family.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="aligncenter size-full wp-image-29797" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2025/09/30/JP-Profile-Pic-1-100x98.jpg" alt="Jayaprakash Boreddy" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Jayaprakash Boreddy&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/jayaprakash-boreddy/" target="_blank" rel="noopener"&gt;Jayaprakash&lt;/a&gt; is a Senior Solutions Architect at AWS. He works with ISV customers in designing and building highly scalable, flexible and resilient applications on AWS Cloud.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="aligncenter size-full wp-image-29797" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2025/09/30/Picture1-39-100x131.png" alt="Sahil Thapar" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Sahil Thapar&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/sahil-thapar-71657b77/" target="_blank" rel="noopener"&gt;Sahil&lt;/a&gt; is a Principal Solutions Architect. He works with ISV customers to help them build highly available, scalable, and resilient applications on the AWS Cloud.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
		
		
			</item>
		<item>
		<title>Meet Amazon Redshift RG – AWS Graviton-based instances with an integrated data lake query engine delivering up to 2.4x better performance at 30% lower price than RA3</title>
		<link>https://aws.amazon.com/blogs/big-data/meet-amazon-redshift-rg-aws-graviton-based-instances-with-an-integrated-data-lake-query-engine-delivering-up-to-2-4x-better-performance-at-30-lower-price-than-ra3/</link>
		
		<dc:creator><![CDATA[Ankit Sahu]]></dc:creator>
		<pubDate>Tue, 19 May 2026 15:38:08 +0000</pubDate>
				<category><![CDATA[Amazon Redshift]]></category>
		<category><![CDATA[Analytics]]></category>
		<category><![CDATA[Graviton]]></category>
		<guid isPermaLink="false">ca762ef4bab0d39a260fedc22d1cb40f1e58c785</guid>

					<description>On May 12, 2026, we announced the general availability of Amazon Redshift RG instances, powered by AWS Graviton processors. RG instances are up to 2.2x as fast for data warehouse workloads and up to 2.4x as fast for data lake workloads, all at 30% lower price per vCPU compared to RA3 instances. RG instances support all data lake formats supported by RA3 and eliminate Amazon Redshift Spectrum’s per-TB scanning charges. RG instances feature a custom-built integrated vectorized query engine, making them a more performant and cost-effective foundation for unified analytics. We are launching with two instance sizes: rg.xlarge and rg.4xlarge, with additional sizes coming later this year.</description>
										<content:encoded>&lt;p&gt;On May 12, 2026, we announced the general availability of &lt;a href="https://aws.amazon.com/redshift/features/rg" target="_blank" rel="noopener"&gt;Amazon Redshift RG instances&lt;/a&gt;, powered by AWS Graviton processors. RG instances are up to 2.2x as fast for data warehouse workloads and up to 2.4x as fast for data lake workloads, all at 30% lower price per vCPU compared to RA3 instances. RG instances support all data lake formats supported by RA3 and eliminate Amazon Redshift Spectrum’s per-TB scanning charges. RG instances feature a custom-built integrated vectorized query engine, making them a more performant and cost-effective foundation for unified analytics.&lt;/p&gt; 
&lt;p&gt;We are launching with two instance sizes: &lt;strong&gt;rg.xlarge&lt;/strong&gt; and &lt;strong&gt;rg.4xlarge&lt;/strong&gt;, with additional sizes coming later this year.&lt;/p&gt; 
&lt;h2&gt;Why we built this&lt;/h2&gt; 
&lt;p&gt;RG instances bring the power of AWS Graviton processors to Amazon Redshift Provisioned clusters for the first time, paired with a purpose-built vectorized query engine. By combining Graviton’s superior price-performance with the latest Amazon Redshift innovations, RG instances deliver a step-change improvement across two dimensions: significantly lower cost and meaningfully faster performance for both warehouse and data lake workloads using Apache Iceberg and Apache Parquet. We built RG to help you avoid choosing between performance and economics. Graviton costs less to operate, and we’re passing that benefit to you while simultaneously raising the performance bar. Equally important, we designed RG to maintain full feature parity with RA3, so you can modernize your existing clusters without rearchitecting workloads or sacrificing capabilities you depend on today.&lt;/p&gt; 
&lt;p&gt;This combination is also increasingly critical for agentic artificial intelligence (AI) workloads. AI agents operating at scale generate a new class of analytics demand: high volumes of unique, unpredictable queries that require fast, low-latency responses to keep agents productive. Traditional price-performance ratios make running these workloads at scale cost-prohibitive. RG instances address this head-on. Lower per-vCPU pricing makes sustained high-query volumes economically viable, while improved query performance makes sure agents get answers fast enough to remain effective. Together, this provides the foundation for AI-driven analytics at the scale and economics that agentic workloads demand.&lt;/p&gt; 
&lt;h2&gt;What’s new&lt;/h2&gt; 
&lt;h3&gt;RG instances: Better performance, lower cost&lt;/h3&gt; 
&lt;p&gt;RG instances run on AWS Graviton, Amazon’s custom-designed cloud processor built from the ground up to deliver superior price-performance and energy efficiency. This translates directly into RG instances offering more compute cores, higher memory bandwidth, and lower inter-process communication latency compared to RA3, with performance improvements across warehouse, data lake, and mixed workloads.&lt;/p&gt; 
&lt;p&gt;Graviton costs less to operate, and we’re passing that benefit directly to you. RG instances are priced at a 30% lower cost per vCPU compared to RA3. Reserved Instance pricing follows the same model, making RG Reserved Instances equally 30% less costly than RA3. For pricing details, visit the &lt;a href="https://aws.amazon.com/redshift/pricing/" target="_blank" rel="noopener"&gt;Amazon Redshift pricing page&lt;/a&gt;.&lt;/p&gt; 
&lt;h3&gt;Performance results&lt;/h3&gt; 
&lt;p&gt;RG instances deliver faster, more efficient analytics across your most demanding warehouse and data lake workloads, whether you’re querying structured data in Amazon Redshift Managed Storage (RMS), running analytics over Iceberg tables in Amazon Simple Storage Service (Amazon S3), or processing Parquet files at scale. Iceberg workloads see the most significant gains, delivering up to &lt;strong&gt;2.4x faster&lt;/strong&gt; query execution. Parquet workloads deliver up to &lt;strong&gt;1.5x faster&lt;/strong&gt; query execution, and RMS-based data warehouse workloads deliver up to &lt;strong&gt;2.2x faster&lt;/strong&gt; query execution. All performance improvements are measured using industry-standard TPC-DS and TPC-H benchmarks at 10 TB scale on rg.4xlarge instances.&lt;/p&gt; 
&lt;p&gt;When combined with RG’s 30% lower per-vCPU pricing compared to RA3, these performance gains translate to even greater price-performance improvements, delivering more analytics value for every dollar spent.&lt;/p&gt; 
&lt;h3&gt;Built-in data lake query engine – no more Spectrum charges&lt;/h3&gt; 
&lt;p&gt;With RA3, data lake queries were offloaded to a separate fleet of nodes called Amazon Redshift Spectrum, scanning data externally and returning results back to the cluster. This architecture introduced network overhead, added latency, and imposed a $5/TB scanning charge on every query. RG instances change this fundamentally with a custom-built vectorized data lake engine running directly inside the cluster, eliminating Spectrum scanning charges.&lt;/p&gt; 
&lt;p&gt;The purpose-built vectorized engine includes a highly optimized scan layer that implements the latest data pruning techniques, a purpose-built I/O subsystem, and a range of optimizations that use Graviton’s processing capabilities to make scanning Iceberg and Parquet data highly efficient. Beyond raw scan performance, the engine introduces JIT ANALYZE, a capability that automatically collects and uses statistics for data lake tables during query execution. This eliminates the need for manual statistics collection. The system uses intelligent heuristics to identify queries that will benefit from statistics, maintains lightweight sketch data structures, and builds high-quality table-level and column-level statistics, all transparently. Having up-to-date statistics on data lake tables can deliver orders-of-magnitude improvements in query performance, and with JIT ANALYZE, you get this benefit automatically without operational overhead.&lt;/p&gt; 
&lt;h2&gt;What customers are saying&lt;/h2&gt; 
&lt;p&gt;&lt;strong&gt;Sean Lynch&lt;/strong&gt;, Vice President, Data and Architecture, Southwest Airlines:&lt;/p&gt; 
&lt;blockquote&gt;
 &lt;p&gt;“Amazon Redshift RG instances have the potential to deliver meaningful business impact for Southwest Airlines. Based on initial testing in our development environment, our data warehouse workloads run 50-60% faster, and data lake analytics are 45% faster, enabling teams to get insights sooner, respond to operational conditions faster, and make data-driven decisions with less latency. These early results are encouraging, and we are excited to validate and scale these improvements in production. All of this comes without per-terabyte Spectrum scanning charges, delivering 30% lower cost than RA3 at a time when fuel prices continue to pressure industry margins.”&lt;/p&gt;
&lt;/blockquote&gt; 
&lt;p&gt;&lt;strong&gt;Akshay Srinivasan&lt;/strong&gt;, Data Engineer, tombola:&lt;/p&gt; 
&lt;blockquote&gt;
 &lt;p&gt;“The new Graviton-based Amazon Redshift RG instances delivered 1.8x-2x faster write throughput and up to 2.2x faster read speeds compared to RA3 across a diverse set of batch and analytical jobs, enabling us to process 40% more within the same window. Compressed ETL cycles, accelerated time-to-insight, and decision-making no longer bottlenecked by the pipeline. Together, these translated directly into fresher data reaching our analysts and business teams sooner. What made this even more compelling was a concurrent 30% reduction in compute spend alongside the gains. Delivering more for less is a rare outcome, and one worth highlighting. In a volume-heavy gaming industry at tombola, where query latency and cost compound at scale, this has been one of the more impactful platform decisions we’ve made this year.”&lt;/p&gt;
&lt;/blockquote&gt; 
&lt;h2&gt;Modernizing your workloads to RG&lt;/h2&gt; 
&lt;p&gt;Today, we are launching rg.xlarge and rg.4xlarge instance sizes, available now for you to modernize your existing Amazon Redshift provisioned workloads. RG instances support three migration paths, all accessible directly from the AWS Management Console:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Elastic Resize&lt;/strong&gt; (recommended): The fastest path for most customers migrating from RA3 or DC2, with only 10-15 minutes of downtime.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Snapshot &amp;amp; Restore&lt;/strong&gt;: Best for you if you need to make configuration changes as part of your migration.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Classic Resize&lt;/strong&gt;: Available for workloads that require a full cluster rebuild.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;Before migrating your production workloads, we strongly recommend validating your queries and workloads on RG instances first. We’ve published an &lt;a href="https://docs.aws.amazon.com/redshift/latest/mgmt/managing-cluster-considerations.html" target="_blank" rel="noopener"&gt;Upgrade Guide&lt;/a&gt; to help you right-size your cluster and plan your migration with confidence.&lt;/p&gt; 
&lt;h2&gt;Getting started&lt;/h2&gt; 
&lt;p&gt;You can start using the RG instances (rg.xlarge and rg.4xlarge) today in the following AWS Regions: US East (N. Virginia), US East (Ohio), US West (Oregon), US West (N. California), Canada (Central), South America (São Paulo), Europe (Ireland), Europe (Frankfurt), Europe (London), Europe (Paris), Europe (Stockholm), Europe (Milan), Europe (Spain), Asia Pacific (Tokyo), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Mumbai), Asia Pacific (Jakarta), Asia Pacific (Hong Kong), Asia Pacific (Osaka), Asia Pacific (Malaysia), Asia Pacific (Hyderabad), Asia Pacific (Taipei), and Asia Pacific (Melbourne).&lt;/p&gt; 
&lt;p&gt;You can launch new clusters or migrate existing clusters through the &lt;a href="https://console.aws.amazon.com/redshift" target="_blank" rel="noopener"&gt;AWS Management Console&lt;/a&gt;, &lt;a href="https://aws.amazon.com/cli" target="_blank" rel="noopener"&gt;AWS Command Line Interface (AWS CLI)&lt;/a&gt;, or AWS API.&lt;/p&gt; 
&lt;h3&gt;To create a new RG cluster in the Amazon Redshift console&lt;/h3&gt; 
&lt;ol type="1"&gt; 
 &lt;li&gt;Review the &lt;a href="https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-clusters.html#rs-about-clusters-and-nodes" target="_blank" rel="noopener"&gt;Cluster and Nodes&lt;/a&gt; in the Amazon Redshift documentation.&lt;/li&gt; 
 &lt;li&gt;Choose Amazon Redshift on the AWS Management Console and choose &lt;strong&gt;Create Cluster&lt;/strong&gt;.&lt;/li&gt; 
 &lt;li&gt;In the &lt;strong&gt;Create Cluster&lt;/strong&gt; screen, choose the required RG node type.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/14/BDB-5935-1.png" alt="Amazon Redshift console showing the Create Cluster screen with RG node type selection" width="600"&gt;&lt;/p&gt; 
&lt;h3&gt;To modernize from RA3 or DC2 in the Amazon Redshift console&lt;/h3&gt; 
&lt;ol type="1"&gt; 
 &lt;li&gt;Review the &lt;a href="https://docs.aws.amazon.com/redshift/latest/mgmt/managing-cluster-considerations.html" target="_blank" rel="noopener"&gt;Upgrade Guide&lt;/a&gt; in the Amazon Redshift documentation.&lt;/li&gt; 
 &lt;li&gt;Choose your migration path. Elastic Resize is the right starting point for most customers.&lt;/li&gt; 
 &lt;li&gt;Choose the required RG node type.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/14/BDB-5935-2.png" alt="Amazon Redshift console showing the Elastic Resize option for migrating to RG instances" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/14/BDB-5935-3.png" alt="Amazon Redshift console showing the node type selection during resize" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/14/BDB-5935-4.png" alt="Amazon Redshift console showing the resize confirmation screen" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;For pricing details, visit the &lt;a href="https://aws.amazon.com/redshift/pricing/" target="_blank" rel="noopener"&gt;Amazon Redshift pricing page&lt;/a&gt;.&lt;/p&gt; 
&lt;h2&gt;Clean up&lt;/h2&gt; 
&lt;p&gt;If you are evaluating RG instances in a test or development environment and do not wish to continue, you can delete your RG cluster directly from the AWS Management Console or by using the AWS CLI to avoid incurring additional charges. If you used Snapshot &amp;amp; Restore to create a test RG cluster alongside your existing RA3 cluster, make sure you delete the RG cluster and any associated snapshots you no longer need. If you are using Data Sharing during migration, remember to remove data shares and decommission your RA3 cluster after you have fully validated your workloads on RG.&lt;/p&gt; 
&lt;h2&gt;Conclusion&lt;/h2&gt; 
&lt;p&gt;Amazon Redshift RG instances represent a significant step forward for you if you run data warehouse and data lake workloads on AWS. By bringing AWS Graviton processors to Amazon Redshift Provisioned clusters for the first time, paired with a purpose-built vectorized native data lake engine, RG instances deliver up to 2.4x better performance on Iceberg workloads, up to 1.5x on Parquet, and up to 2.2x on RMS data warehouse workloads, all at 30% lower per-vCPU cost than RA3. The elimination of Amazon Redshift Spectrum scanning charges makes data lake query costs predictable for the first time.&lt;/p&gt; 
&lt;p&gt;To get started with RG instances, visit the &lt;a href="https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-clusters.html" target="_blank" rel="noopener"&gt;Amazon Redshift RG documentation&lt;/a&gt; to assess your workload and plan your migration.&lt;/p&gt; 
&lt;h2&gt;Resources&lt;/h2&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/redshift/latest/mgmt/working-with-clusters.html" target="_blank" rel="noopener"&gt;Amazon Redshift RG Instance Documentation&lt;/a&gt;.&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/redshift/latest/mgmt/managing-cluster-considerations.html" target="_blank" rel="noopener"&gt;Upgrade Guide&lt;/a&gt;.&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://aws.amazon.com/redshift/pricing/" target="_blank" rel="noopener"&gt;Amazon Redshift Pricing&lt;/a&gt;.&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://repost.aws/tags/questions/TAByF7MpfSQUCX_lAeDTvODw" target="_blank" rel="noopener"&gt;AWS re:Post – Amazon Redshift Community&lt;/a&gt;.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;&lt;em&gt;Questions or feedback? Drop a comment or join the discussion on &lt;a href="https://repost.aws/" target="_blank" rel="noopener"&gt;AWS re:Post&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt; 
&lt;hr&gt; 
&lt;h2&gt;About the authors&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignnone size-full wp-image-91371" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/18/ankit.png" alt="" width="512" height="512"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Ankit Sahu&lt;/h3&gt; 
  &lt;p&gt;Ankit Sahu brings over 18 years of expertise in building innovative data products and services. His diverse experience spans product strategy, go-to-market execution, and digital transformation initiatives. Currently, as Sr. Product Manager at Amazon Web Services (AWS), Ankit is driving the vision and strategy for Amazon Redshift.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
		
		
			</item>
		<item>
		<title>OpenSearch Agent Skills bring built-in intelligence to your agentic IDE</title>
		<link>https://aws.amazon.com/blogs/big-data/opensearch-agent-skills-bring-built-in-intelligence-to-your-agentic-ide/</link>
		
		<dc:creator><![CDATA[Bobby Mohammed]]></dc:creator>
		<pubDate>Mon, 18 May 2026 19:15:11 +0000</pubDate>
				<category><![CDATA[Amazon OpenSearch Service]]></category>
		<category><![CDATA[Analytics]]></category>
		<category><![CDATA[Launch]]></category>
		<guid isPermaLink="false">708fa9a3c19c5defcfa6b21dc05491836851d1c1</guid>

					<description>Today, we’re launching OpenSearch Agent Skills, a repository of open, composable skills that bring built-in intelligence to developer workflows with OpenSearch, directly inside your favorite agentic IDE. By embedding OpenSearch expertise into the developer’s existing workflow, Agent Skills reduce setup time, eliminate unnecessary tool-hopping, and let teams focus on building rather than configuring.</description>
										<content:encoded>&lt;p&gt;Today, we’re launching OpenSearch Agent Skills, a repository of open, composable skills that bring built-in intelligence to developer workflows with OpenSearch, directly inside your favorite agentic IDE. By embedding OpenSearch expertise into the developer’s existing workflow, Agent Skills reduce setup time, eliminate unnecessary tool-hopping, and let teams focus on building rather than configuring.&lt;/p&gt; 
&lt;p&gt;Developers today can go from idea to working prototype in minutes using agentic IDEs like Claude, Cursor, and Kiro. They can spin up applications, generate APIs, and build end-to-end workflows with a prompt. But whether you’re experimenting with a new idea, building a POC, or running production systems, the experience quickly becomes more complex. For example, improving relevance in OpenSearch still requires deep expertise in query Domain-Specific Language (DSL), ranking logic, and hybrid search tuning. Troubleshooting latency or cluster health issues often means manually piecing together signals from logs, traces, shards, and infrastructure metrics. Even migrations from Elasticsearch or Solr can become complex and time-consuming because of schema conversion, compatibility gaps, and performance optimization challenges. As AI agents become a primary interface for building and operating applications on OpenSearch, a deeper gap emerges. Translating high-level intent into query DSLs, index configurations, and multi-step workflows still requires significant expertise. At the same time, workflows remain fragmented across domains like search, logs, and observability, forcing teams into siloed tooling and disconnected reasoning. The result is repeated trial-and-error, lack of standardized approaches, and slower time-to-value, despite the promise of faster development.&lt;/p&gt; 
&lt;h2&gt;What are Agent Skills?&lt;/h2&gt; 
&lt;p&gt;&lt;a href="https://agentskills.io/home" target="_blank" rel="noopener"&gt;Agent Skills,&lt;/a&gt; developed by Anthropic, are a lightweight, open format for extending AI agent capabilities with specialized knowledge and workflows. They’re supported by a growing number of AI tools and agentic clients, including Kiro, Claude Code, Cursor, VS Code, GitHub Copilot, Codex and others.&lt;/p&gt; 
&lt;p&gt;At their core, Agent Skills are pre-built intelligence you can call, extend, and reuse. Each skill encapsulates domain knowledge, execution logic with multi-step workflows, and guidance with explainability, so you not only get results but understand how they’re achieved. Instead of stitching together tools and writing custom logic, you can invoke a skill to handle an entire task, from analysis to recommendation to execution.&lt;/p&gt; 
&lt;p&gt;At launch, OpenSearch Agent Skills introduces three foundational skills designed to address some of the most common and complex developer workflows: &lt;strong&gt;Search, Logs, and Solr to OpenSearch Migrations&lt;/strong&gt;.&lt;/p&gt; 
&lt;h2&gt;Search skill&lt;/h2&gt; 
&lt;p&gt;The Search Skill builds on the foundation introduced by &lt;a href="https://opensearch.org/blog/introducing-opensearch-launchpad-from-requirements-to-a-running-search-application-in-minutes/" target="_blank" rel="noopener"&gt;OpenSearch Launchpad&lt;/a&gt;, and brings an agentic, intent-driven experience to building and optimizing search applications with OpenSearch. Developers can go from a simple requirement or sample document to a fully working search application in minutes, whether lexical, semantic, hybrid, or agentic, with no&lt;/p&gt; 
&lt;p&gt;deep OpenSearch expertise required.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt;&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Translates natural language requirements or sample data into search configurations.&lt;/li&gt; 
 &lt;li&gt;Automatically creates index mappings, ingest pipelines, and ML model integrations.&lt;/li&gt; 
 &lt;li&gt;Sets up keyword, semantic, and hybrid search capabilities out of the box.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;&lt;strong&gt;Example&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;&lt;code&gt;Build a semantic search application for product documentation&lt;/code&gt;&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Fully configured OpenSearch index with optimized mappings.&lt;/li&gt; 
 &lt;li&gt;Integrated embedding models and ingest pipeline.&lt;/li&gt; 
 &lt;li&gt;Working search experience (API + UI) ready to test and iterate.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;The Search Skill builds on the foundation introduced by OpenSearch Launchpad, extending the same capabilities into an agent-native workflow. You can move from idea to a production-ready search application in minutes, eliminating manual setup and accelerating both prototyping and deployment in OpenSearch.&lt;/p&gt; 
&lt;h2&gt;Logs skill&lt;/h2&gt; 
&lt;p&gt;The Log Skill analyzes log data and investigates distributed traces directly within OpenSearch, bringing agentic intelligence to observability workflows. Instead of manually crafting PPL queries or piecing together trace data across services, developers can express their intent and let the skill&lt;/p&gt; 
&lt;p&gt;handle the complexity.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt;&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Queries and analyzes log data using PPL, including error patterns, log volume trends, and anomaly detection.&lt;/li&gt; 
 &lt;li&gt;Investigates distributed traces, identifying slow spans, error spans, service dependencies, and agent invocations.&lt;/li&gt; 
 &lt;li&gt;Correlates logs and traces using traceId to surface root causes across the full observability stack.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;&lt;code&gt;Investigate why my service is returning 500s and correlate with recent traces&lt;/code&gt;&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;PPL query results surfacing error patterns and log volume anomalies.&lt;/li&gt; 
 &lt;li&gt;Trace analysis identifying slow or failing spans and service dependencies.&lt;/li&gt; 
 &lt;li&gt;Correlated view linking log errors to specific trace IDs for faster root cause analysis.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;With the Logs Skill, you can move from a vague symptom to a pinpointed root cause in minutes without needing to master PPL syntax or manually navigate trace data.&lt;/p&gt; 
&lt;h2&gt;Solr to OpenSearch migration skill&lt;/h2&gt; 
&lt;p&gt;The Migration Skill streamlines the complex process of migrating from Solr to OpenSearch. Migrations typically involve cluster discovery, compatibility checks, schema translation, data movement, and validation. These steps often require deep expertise and manual coordination. The&lt;/p&gt; 
&lt;p&gt;Migration skill turns all these steps into a guided, automated workflow.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;What it does:&lt;/strong&gt;&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Discovers and analyzes source clusters, including indices, mappings, and configurations.&lt;/li&gt; 
 &lt;li&gt;Performs compatibility assessment and highlights breaking changes or required transformations.&lt;/li&gt; 
 &lt;li&gt;Translates schemas, index settings, and queries into OpenSearch-compatible formats.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;&lt;code&gt;How can I migrate from Solr to OpenSearch?&lt;/code&gt;&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Output:&lt;/strong&gt;&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Detailed migration plan with compatibility report and required changes.&lt;/li&gt; 
 &lt;li&gt;Translated index mappings and configurations ready for OpenSearch.&lt;/li&gt; 
 &lt;li&gt;Executed data migration pipeline with progress tracking.&lt;/li&gt; 
 &lt;li&gt;Validation report confirming data integrity and query parity between source and target.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;With the Migration Skill, developers can move from a fragmented, high-risk migration process to a structured, automated workflow. This approach provides faster transitions, reduced downtime, and confidence in production readiness.&lt;/p&gt; 
&lt;h2&gt;How it works&lt;/h2&gt; 
&lt;p&gt;OpenSearch Agent Skills are organized as a tree of SKILL.md files, structured by domain category. Rather than one monolithic skill that loads everything, the repo is broken into focused, independently installable skills. Each skill is small enough to stay within a tight context window, but&lt;/p&gt; 
&lt;p&gt;complete enough to handle real end-to-end workflows.&lt;/p&gt; 
&lt;p&gt;The top-level structure currently groups skills into three categories:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Search:&lt;/strong&gt; opensearch-launchpad for building BM25, semantic, and hybrid search applications from scratch.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Observability:&lt;/strong&gt; log-analytics for PPL-based log querying and error analysis, and trace-analytics for distributed trace investigation and span analysis.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Cloud:&lt;/strong&gt; aws-setup for deploying to Amazon OpenSearch Service (managed) or Amazon OpenSearch Serverless, with separate manifests for each.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;Each skill bundles everything the agent needs: step-by-step workflows, reference docs (like PPL syntax guides and CLI references), and executable scripts that run directly against your cluster.&lt;/p&gt; 
&lt;p&gt;When you say &lt;em&gt;“build a hybrid search app”&lt;/em&gt; or &lt;em&gt;“why is my service throwing 500 errors?”&lt;/em&gt;, the agent activates only the matching skill, follows its instructions, and executes the right OpenSearch APIs. It returns results alongside clear explanations of what was configured and why. Because skills load on demand, you can have the full collection installed without bloating your agent’s context window.&lt;/p&gt; 
&lt;p&gt;We’re continuously expanding the skill library. Categories like Dashboard and Migration are already on the roadmap, with more to come as the ecosystem grows.&lt;/p&gt; 
&lt;h2&gt;Getting started&lt;/h2&gt; 
&lt;p&gt;Getting started with OpenSearch Agent Skills is straightforward. No MCP server or extras are required. Skills are installed using npx skills and work directly with your existing agentic IDE.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Prerequisites:&lt;/strong&gt;&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Python 3.11+ and uv.&lt;/li&gt; 
 &lt;li&gt;Docker installed and running.&lt;/li&gt; 
 &lt;li&gt;AWS credentials configured (optional, for cloud deployment).&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;&lt;strong&gt;Install all skills:&lt;/strong&gt;&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-bash"&gt;npx skills add opensearch-project/opensearch-agent-skills&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;&lt;strong&gt;Or install a specific skill: (e.g.&amp;nbsp;opensearch-launchpad)&lt;/strong&gt;&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="language-bash"&gt;npx skills add opensearch-project/opensearch-agent-skills@opensearch-launchpad --full-depth

npx skills add opensearch-project/opensearch-agent-skills@log-analytics --full-depth

npx skills add opensearch-project/opensearch-agent-skills@trace-analytics --full-depth

npx skills add opensearch-project/opensearch-agent-skills@migration-companion --full-depth&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Once installed, simply express your intent to your agent, for example, &lt;em&gt;“I want to build a semantic search app with OpenSearch,”&lt;/em&gt; and the agent reads the skill instructions and runs the scripts automatically.&lt;/p&gt; 
&lt;p&gt;Skills can also be installed to a specific agent (&lt;code&gt;-a claude-code&lt;/code&gt;), globally across all projects (&lt;code&gt;-g&lt;/code&gt;), or to all detected agents (&lt;code&gt;--all&lt;/code&gt;). Explore available skills before installing with &lt;code&gt;--list&lt;/code&gt;.&lt;/p&gt; 
&lt;h2&gt;Looking ahead&lt;/h2&gt; 
&lt;p&gt;This is just the beginning. We’re actively expanding the OpenSearch Agent Skills ecosystem with new capabilities across advanced relevance tuning, cost-aware performance optimization, index lifecycle and schema evolution, and cross-domain workflows that unify search, logs, and analytics.&lt;/p&gt; 
&lt;p&gt;Over time, we see Agent Skills becoming a community-driven knowledge layer across OpenSearch domains where solving a complex problem once means everyone benefits. More importantly, Agent Skills mark a fundamental shift in how developers build and operate with OpenSearch: moving away from manual, fragmented workflows toward intelligent, reusable capabilities that guide, optimize, and accelerate development at every stage.&lt;/p&gt; 
&lt;h2&gt;Get involved&lt;/h2&gt; 
&lt;p&gt;OpenSearch Agent Skills is designed to be an open, evolving ecosystem, and we’re getting started. Here’s how you can participate:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Try it in your workflow.&lt;/strong&gt; Install the skills in Claude, Cursor, or Kiro and start interacting with OpenSearch using natural language. Build new applications, investigate issues, or run migrations, and see how far intent-driven workflows can go.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Build and extend skills.&lt;/strong&gt; Agent Skills are intentionally modular and extensible. Create your own skills to encode domain-specific workflows, internal best practices, or repeatable operational playbooks. Whether it’s a custom relevance tuning flow or a specialized observability pipeline, your contributions can become reusable intelligence for others.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Contribute to the ecosystem.&lt;/strong&gt; We welcome contributions across all levels, from improving documentation and fixing bugs to adding entirely new skills. If you’ve solved a complex problem with OpenSearch, consider turning it into a skill and contribute to the &lt;a href="https://github.com/opensearch-project/opensearch-agent-skills" target="_blank" rel="noopener"&gt;Git repo&lt;/a&gt;.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Share feedback and ideas.&lt;/strong&gt; Let us know what worked, what didn’t, and what capabilities you’d like to see next, whether it’s deeper integrations, new domains, or more advanced automation.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Join the conversation.&lt;/strong&gt; Engage with the OpenSearch community through GitHub discussions, &lt;a href="https://forum.opensearch.org/" target="_blank" rel="noopener"&gt;community forums&lt;/a&gt;, and working groups. Collaborate with others building similar workflows and help define the future of agent-driven search and observability.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;With OpenSearch Agent Skills, we’re moving toward a world where developers don’t only use tools but use shared intelligence. If that resonates with you, we’d love for you to be part of the journey.&lt;/p&gt; 
&lt;p&gt;Star and get involved in the &lt;a href="https://github.com/opensearch-project/opensearch-agent-skills" target="_blank" rel="noopener"&gt;OpenSearch Agent Skills repo.&lt;/a&gt; Join the conversation on the OpenSearch &lt;a href="https://forum.opensearch.org/" target="_blank" rel="noopener"&gt;community forum&lt;/a&gt; and connect with us in the &lt;a href="https://opensearch.org/slack/" target="_blank" rel="noopener"&gt;OpenSearch Slack channel.&lt;/a&gt;&lt;/p&gt; 
&lt;h2&gt;Acknowledgments&lt;/h2&gt; 
&lt;p&gt;We would like to extend our sincere gratitude to the following contributors for their valuable contributions to this project Arjun kumar Giri, Sarat Vemulapalli, Chenyang Li, Fen Qin, Janelle Arita, Kaituo Li, Krishna Kondaka, Owais Kazi, Peter Zhu and Zhichao Geng. Your dedication, expertise, and collaborative spirit have been instrumental in making this project successful. Thank you for your time and contributions.&lt;/p&gt; 
&lt;hr&gt; 
&lt;h2&gt;About the authors&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt; 
   &lt;p&gt;&lt;img loading="lazy" class="alignleft size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/14/BDB-5934-1.png" alt="Bobby Mohammed" width="100" height="100"&gt;&lt;/p&gt; 
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Bobby Mohammed&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/PLACEHOLDER" target="_blank" rel="noopener"&gt;Bobby&lt;/a&gt; is a Principal Product Manager at AWS, leading product initiatives at the intersection of Search, Generative AI, and Agentic AI. His work focuses on next-generation intelligent applications, from retrieval-augmented generation (RAG) to agent-driven workflows and long-term memory systems. Previously, he helped build foundational AI and data capabilities on Amazon SageMaker, spanning data, analytics, and machine learning at scale. Prior to AWS, he served as Director of Product at Intel, leading deep learning training and inference platforms powering high-performance AI infrastructure. Bobby holds an MBA from the Kellogg School of Management at Northwestern University, and Master’s and Bachelor’s degrees in Electrical Engineering.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt; 
   &lt;p&gt;&lt;img loading="lazy" class="alignleft size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/14/BDB-5934-2.png" alt="Sean Zheng" width="100" height="100"&gt;&lt;/p&gt; 
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Sean Zheng&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/PLACEHOLDER" target="_blank" rel="noopener"&gt;Sean&lt;/a&gt; is a Senior Engineering Manager at AWS, where he leads ML/GenAI and search relevancy components within AWS OpenSearch. His team owns plugins including ML Commons, Neural Search, and Search Relevancy Workbench, serving as the primary driver of ML and agentic capabilities for OpenSearch. Recent deliveries under his team include Agentic Search, Agentic Memory, and a Python-based agentic service. Prior to his role with AWS OpenSearch, Sean worked across multiple teams in Amazon’s retail organization, focusing on machine learning and data analytics. His experience spans Core ML, Product Graph, and Search Engine Optimization teams. Sean holds a PhD degree in Computer Science from State University of New York.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
		
		
			</item>
		<item>
		<title>How Smartsheet built Real-time Dynamic Filtering on Apache Flink reducing $40K/month in messaging costs</title>
		<link>https://aws.amazon.com/blogs/big-data/how-smartsheet-built-real-time-dynamic-filtering-on-apache-flink-reducing-40k-month-in-messaging-costs/</link>
		
		<dc:creator><![CDATA[Emre Kartoglu]]></dc:creator>
		<pubDate>Mon, 18 May 2026 18:59:23 +0000</pubDate>
				<category><![CDATA[Amazon Managed Service for Apache Flink]]></category>
		<category><![CDATA[Analytics]]></category>
		<category><![CDATA[Customer Solutions]]></category>
		<category><![CDATA[Apache Flink]]></category>
		<category><![CDATA[DynamoDB]]></category>
		<guid isPermaLink="false">7a6963252367f00fec510f993735116a6be1dcb3</guid>

					<description>In this post, you learn how Smartsheet built a Real-time Dynamic Filtering (RDF) system on Amazon Managed Service for Apache Flink, cutting messaging costs by over $40,000 per month and improving live collaboration latency by 1.8x.</description>
										<content:encoded>&lt;p&gt;Processing hundreds of thousands of events per second while maintaining sub-second latency is a challenge many organizations face when building real-time data-driven applications. When filter policy changes propagate in up to 15 minutes, dynamic event routing becomes impractical, forcing teams to over-consume events and discard over 90% after costly per-event lookups. Smartsheet, a work management solution serving millions of users and processing hundreds of thousands of events per second to power features like live collaboration, workflows, and real-time notifications, faced exactly this problem.&lt;/p&gt; 
&lt;p&gt;In this post, you learn how Smartsheet built a Real-time Dynamic Filtering (RDF) system on &lt;a href="https://aws.amazon.com/managed-service-apache-flink/" target="_blank" rel="noopener"&gt;Amazon Managed Service for Apache Flink&lt;/a&gt;, cutting messaging costs by over $40,000 per month and improving live collaboration latency by 1.8x.&lt;/p&gt; 
&lt;h2&gt;The challenge: Static filter policies in a dynamic world&lt;/h2&gt; 
&lt;p&gt;The Smartsheet event-driven architecture publishes hundreds of thousands of events per second to an &lt;a href="https://aws.amazon.com/sns/" target="_blank" rel="noopener"&gt;Amazon Simple Notification Service&lt;/a&gt; (Amazon SNS) topic. Internal teams subscribe to this topic, typically by creating an &lt;a href="https://aws.amazon.com/sqs/" target="_blank" rel="noopener"&gt;Amazon Simple Queue Service&lt;/a&gt; (Amazon SQS) queue with an associated &lt;a href="https://docs.aws.amazon.com/sns/latest/dg/sns-message-filtering.html" target="_blank" rel="noopener"&gt;SNS filter policy&lt;/a&gt; defined through infrastructure as code (IaC). These filter policies are typically static and specify the types of events a consumer wants to receive, such as “sheet row created,” “sheet row updated,” or “sheet row deleted.”&lt;/p&gt; 
&lt;p&gt;Although SNS supports programmatic changes to filter policies, the &lt;a href="https://docs.aws.amazon.com/sns/latest/dg/sns-message-filtering-policy-update.html" target="_blank" rel="noopener"&gt;SNS documentation&lt;/a&gt; notes that changes can take up to 15 minutes to take effect. This eventual consistency window created a significant problem for Smartsheet live collaboration feature.&lt;/p&gt; 
&lt;p&gt;Live collaboration requires knowing, in real time, which sheets have active collaborators. When a user opens a sheet, the system needs to immediately start receiving events for that sheet. When they close it, the system should stop. With a 15-minute propagation delay on filter policy changes, dynamic per-sheet filtering through SNS was impractical.&lt;/p&gt; 
&lt;p&gt;The workaround was brute force: subscribe to all events (hundreds of thousands per second), pull them into an SQS queue, and use compute to check each event against &lt;a href="https://aws.amazon.com/dynamodb/" target="_blank" rel="noopener"&gt;Amazon DynamoDB&lt;/a&gt; to determine whether the sheet had active collaborators. Over 90% of events were discarded after this lookup.&lt;/p&gt; 
&lt;div style="width: 610px" class="wp-caption alignnone"&gt;
 &lt;img loading="lazy" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/06/BDB-5904-1.png" alt="Architecture diagram showing events flowing from SNS to SQS with per-event DynamoDB lookups before RDF" width="600" height="535"&gt;
 &lt;p class="wp-caption-text"&gt;Figure 1: Before RDF — all events flow through SNS to SQS, with per-event DynamoDB lookups to filter. Over 90% of events are discarded after processing.&lt;/p&gt;
&lt;/div&gt; 
&lt;ol type="1"&gt; 
 &lt;li&gt;Every event published to the SNS topic is delivered to the SQS queue, regardless of whether any consumer needs it.&lt;/li&gt; 
 &lt;li&gt;The consumer AWS Lambda reads every message from the SQS queue and must evaluate each one individually.&lt;/li&gt; 
 &lt;li&gt;For each event, the consumer queries DynamoDB to check whether the sheet has active collaborators. This per-event lookup adds latency and DynamoDB read costs on the hot path.&lt;/li&gt; 
 &lt;li&gt;After the DynamoDB lookup, over 90% of events are found to have no active collaborators and are discarded.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;This approach had three compounding cost and performance problems:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;SNS-to-SQS data transfer costs&lt;/strong&gt;: approximately $10,000 per month to deliver all events to the queue&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;SQS costs&lt;/strong&gt;: approximately $30,000 per month to receive, process, and delete the full event volume&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;DynamoDB costs and latency&lt;/strong&gt;: per-event lookups to check collaborator status added load to DynamoDB and increased end-to-end data delivery latency&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;The solution: Real-time Dynamic Filtering with Apache Flink&lt;/h2&gt; 
&lt;p&gt;To solve this, Smartsheet built a system called Real-time Dynamic Filtering (RDF) on Amazon Managed Service for Apache Flink. The core insight was to move the filtering logic into the stream processing layer itself, using Flink’s &lt;a href="https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/operators/process_function/#the-coprocessfunction" target="_blank" rel="noopener"&gt;KeyedCoProcessFunction&lt;/a&gt;, a feature that joins and processes multiple streams by a shared key, to maintain dynamic filter policies in Flink state (RocksDB).&lt;/p&gt; 
&lt;h3&gt;How it works&lt;/h3&gt; 
&lt;p&gt;The RDF Flink application reads from two streams:&lt;/p&gt; 
&lt;ol type="1"&gt; 
 &lt;li&gt;&lt;strong&gt;Filter policy stream&lt;/strong&gt;, sourced from &lt;a href="https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html" target="_blank" rel="noopener"&gt;Amazon DynamoDB Streams&lt;/a&gt;. When a team calls the RDF client to change their filter policy (for example, “start receiving events for sheet X”), the change is written to a DynamoDB table and propagated through DynamoDB Streams to the Flink application.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Data stream&lt;/strong&gt;, the stream of sheet events (creates, updates, deletes) that were previously delivered through SNS.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;One challenge remained: some consumers need every event, regardless of sheet. When a consumer subscribes to all events, the system needs every parallel Flink task to know about it. The team solved this using Flink’s broadcast state, which replicates a small set of “subscribe to everything” policies across all tasks. Because only a handful of consumers use this mode, the memory overhead stays negligible.&lt;/p&gt; 
&lt;div style="width: 610px" class="wp-caption alignnone"&gt;
 &lt;img loading="lazy" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/06/BDB-5904-2.png" alt="Architecture diagram showing the RDF system with DynamoDB Streams feeding filter policies to the Flink application" width="600" height="410"&gt;
 &lt;p class="wp-caption-text"&gt;Figure 2: After RDF — consumer teams update filter policies via client libraries. DynamoDB Streams propagates changes to the Flink application, which filters the data stream in real time using keyed state (RocksDB) for specific sheet subscriptions and broadcast state for “all sheets” subscriptions.&lt;/p&gt;
&lt;/div&gt; 
&lt;ol&gt; 
 &lt;li&gt;When a consumer team wants to start or stop receiving events for a specific sheet, it calls the RDF client, a thin wrapper over the DynamoDB SDK. The filter policy change is written to that consumer’s dedicated DynamoDB table. Each consumer has its own table, providing isolated permissions and preventing noisy neighbor issues.&lt;/li&gt; 
 &lt;li&gt;DynamoDB Streams captures every filter policy change as a change data capture (CDC) record and streams it to the Flink application in real time.&lt;/li&gt; 
 &lt;li&gt;Filter policy records 
  &lt;ol type="a"&gt; 
   &lt;li&gt;Filter policy records for specific sheets are routed to the &lt;code&gt;KeyedCoProcessFunction&lt;/code&gt;, keyed by &lt;code&gt;SheetID&lt;/code&gt;. This makes sure that filter state and event data for the same sheet are co-located in the same Flink parallel task. State is stored in the RocksDB backend, which uses memory when available and spills to disk when necessary, so the system to scale without JVM heap constraints.&lt;/li&gt; 
   &lt;li&gt;Filter policy records where a consumer has called &lt;code&gt;listenToAllEvents()&lt;/code&gt; are broadcast to all parallel Flink tasks via Flink’s broadcast state. Because broadcast state lives in JVM heap, it is used exclusively for these “all sheets” records (of which there are very few), keeping the heap footprint small.&lt;/li&gt; 
  &lt;/ol&gt; &lt;/li&gt; 
 &lt;li&gt;The full stream of CDC events flows into the &lt;code&gt;KeyedCoProcessFunction&lt;/code&gt;, partitioned by SheetID. Each parallel task receives only the events for the sheets it is responsible for and applies the corresponding filter state to decide whether to forward or drop each event.&lt;/li&gt; 
 &lt;li&gt;The broadcast state (containing “all sheets” subscriptions) is made available to all parallel instances of the &lt;code&gt;KeyedCoProcessFunction&lt;/code&gt;, so that consumers subscribed to all events are never filtered out regardless of which task processes their events.&lt;/li&gt; 
 &lt;li&gt;Only events that match an active filter policy are forwarded to the consumer’s SQS queue. The result: sub-second filter policy propagation (p95 ≤1s), elimination of per-event DynamoDB lookups, and over $40,000/month in cost savings.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;Critically, because the filter policy state is persisted in Flink’s &lt;a href="https://nightlies.apache.org/flink/flink-docs-stable/docs/ops/state/state_backends/#the-rocksdbstatebackend" target="_blank" rel="noopener"&gt;RocksDB state backend&lt;/a&gt;, the application does not need to perform a DynamoDB lookup for every event. Within 1 second of a filter policy change, the Flink application reads the change from the DynamoDB Streams source, updates its internal state, and begins filtering the data stream accordingly.&lt;/p&gt; 
&lt;h2&gt;Results&lt;/h2&gt; 
&lt;p&gt;The impact of RDF was immediate and measurable across multiple dimensions:&lt;/p&gt; 
&lt;h3&gt;Cost reduction&lt;/h3&gt; 
&lt;table style="width: 100%;border-collapse: collapse;border: 1px solid #d5dbdb" border="1" cellspacing="0" cellpadding="0"&gt; 
 &lt;colgroup&gt; 
  &lt;col style="width: 35%"&gt; 
  &lt;col style="width: 17%"&gt; 
  &lt;col style="width: 27%"&gt; 
  &lt;col style="width: 19%"&gt; 
 &lt;/colgroup&gt; 
 &lt;thead&gt; 
  &lt;tr style="background-color: #f2f3f3"&gt; 
   &lt;th style="border: 1px solid #d5dbdb;padding: 4px 8px"&gt;&lt;strong&gt;Cost category&lt;/strong&gt;&lt;/th&gt; 
   &lt;th style="border: 1px solid #d5dbdb;padding: 4px 8px"&gt;&lt;strong&gt;Before RDF&lt;/strong&gt;&lt;/th&gt; 
   &lt;th style="border: 1px solid #d5dbdb;padding: 4px 8px"&gt;&lt;strong&gt;After RDF&lt;/strong&gt;&lt;/th&gt; 
   &lt;th style="border: 1px solid #d5dbdb;padding: 4px 8px"&gt;&lt;strong&gt;Monthly savings&lt;/strong&gt;&lt;/th&gt; 
  &lt;/tr&gt; 
 &lt;/thead&gt; 
 &lt;tbody&gt; 
  &lt;tr&gt; 
   &lt;td style="border: 1px solid #d5dbdb;padding: 4px 8px"&gt;SNS → SQS Data Transfer&lt;/td&gt; 
   &lt;td style="border: 1px solid #d5dbdb;padding: 4px 8px"&gt;~$10K/month&lt;/td&gt; 
   &lt;td style="border: 1px solid #d5dbdb;padding: 4px 8px"&gt;Eliminated&lt;/td&gt; 
   &lt;td style="border: 1px solid #d5dbdb;padding: 4px 8px"&gt;~$10K&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="border: 1px solid #d5dbdb;padding: 4px 8px"&gt;SQS Event Ingestion&lt;/td&gt; 
   &lt;td style="border: 1px solid #d5dbdb;padding: 4px 8px"&gt;~$30K/month&lt;/td&gt; 
   &lt;td style="border: 1px solid #d5dbdb;padding: 4px 8px"&gt;~$2K&lt;/td&gt; 
   &lt;td style="border: 1px solid #d5dbdb;padding: 4px 8px"&gt;~$28K&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="border: 1px solid #d5dbdb;padding: 4px 8px"&gt;DynamoDB Collaborator Lookups&lt;/td&gt; 
   &lt;td style="border: 1px solid #d5dbdb;padding: 4px 8px"&gt;Significant load&lt;/td&gt; 
   &lt;td style="border: 1px solid #d5dbdb;padding: 4px 8px"&gt;Eliminated (state in Flink)&lt;/td&gt; 
   &lt;td style="border: 1px solid #d5dbdb;padding: 4px 8px"&gt;Included in total&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td style="border: 1px solid #d5dbdb;padding: 4px 8px"&gt;AWS Lambda&lt;/td&gt; 
   &lt;td style="border: 1px solid #d5dbdb;padding: 4px 8px"&gt;~$12K/month&lt;/td&gt; 
   &lt;td style="border: 1px solid #d5dbdb;padding: 4px 8px"&gt;~$5K/month&lt;/td&gt; 
   &lt;td style="border: 1px solid #d5dbdb;padding: 4px 8px"&gt;~$7K&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr style="background-color: #f2f3f3"&gt; 
   &lt;td style="border: 1px solid #d5dbdb;padding: 4px 8px"&gt;&lt;strong&gt;Total&lt;/strong&gt;&lt;/td&gt; 
   &lt;td style="border: 1px solid #d5dbdb;padding: 4px 8px"&gt;&lt;/td&gt; 
   &lt;td style="border: 1px solid #d5dbdb;padding: 4px 8px"&gt;&lt;/td&gt; 
   &lt;td style="border: 1px solid #d5dbdb;padding: 4px 8px"&gt;~$45K/month&lt;/td&gt; 
  &lt;/tr&gt; 
 &lt;/tbody&gt; 
&lt;/table&gt; 
&lt;h3&gt;Latency improvement&lt;/h3&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;1.8x improvement&lt;/strong&gt; in live collaboration data delivery latency. Users see changes from collaborators faster than before.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Filter policy propagation&lt;/strong&gt; reduced from up to 15 minutes to a p95 of under 1 second&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;If your architecture follows a similar fan-out pattern where consumers discard a large percentage of events after per-event lookups, you could achieve comparable cost reductions by moving filtering into the stream processing layer. The savings scale with your event volume and the percentage of events currently discarded.&lt;/p&gt; 
&lt;h2&gt;Key design decisions&lt;/h2&gt; 
&lt;p&gt;Several architectural choices were critical to the success of this solution:&lt;/p&gt; 
&lt;ol type="1"&gt; 
 &lt;li&gt;&lt;strong&gt;Keyed state with selective broadcast&lt;/strong&gt;: Specific sheet subscriptions are stored in keyed state using the RocksDB state backend. The system scales to a large number of filter policies without JVM heap constraints. Flink’s broadcast state is used only for the small number of “all sheets” subscriptions, where every parallel task needs visibility. Because broadcast state is stored in JVM heap, limiting its use to these few records keeps the heap footprint manageable.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;DynamoDB Streams as the filter policy source&lt;/strong&gt;: Rather than building a custom control plane, the team used DynamoDB Streams to propagate filter policy changes. DynamoDB Streams gave the team durability, ordering guarantees, and a native Flink source connector integration.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;RocksDB state backend&lt;/strong&gt;: Persisting filter state in RocksDB eliminated the need for external lookups on the hot path, keeping per-event processing latency low even as the number of active filter policies grows.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Client library abstraction&lt;/strong&gt;: Publishing internal Golang and Java clients lowered the adoption barrier. The client is a thin abstraction on top of the DynamoDB SDK. Each consumer has its own dedicated DynamoDB table and corresponding filter stream, which provides two benefits: it allows fine-grained AWS Identity and Access Management (AWS IAM) permissions per client, and it mitigates the noisy neighbor problem by isolating each consumer’s filter policy traffic. Teams don’t need to understand Flink internals. They interact with a simple API to manage their subscriptions.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;h3&gt;Next steps&lt;/h3&gt; 
&lt;p&gt;The live collaboration team was the first adopter of RDF, but the architecture was designed as a shared platform. Smartsheet is now expanding RDF to additional internal teams, including workflow automation and notification routing, where similar fan-out patterns exist. The team is also exploring automatic scaling policies to optimize Flink cluster costs during off-peak hours.&lt;/p&gt; 
&lt;h2&gt;Conclusion&lt;/h2&gt; 
&lt;p&gt;Smartsheet Real-time Dynamic Filtering system demonstrates how &lt;a href="https://aws.amazon.com/managed-service-apache-flink/" target="_blank" rel="noopener"&gt;Amazon Managed Service for Apache Flink&lt;/a&gt; can solve problems that go beyond stream processing. By combining Flink’s broadcast state pattern with CoProcessFunction, Smartsheet replaced a costly and latency-bound SNS/SQS fan-out architecture with a sub-second dynamic filtering platform. The result: over $40,000 per month in savings, 1.8x improvement in live collaboration latency, and a reusable platform that multiple teams are now adopting.&lt;/p&gt; 
&lt;p&gt;If you process high-volume event streams and need to dynamically control which events reach specific consumers, this pattern can help you reduce costs and latency, whether for live collaboration, workflow automation, notification routing, or multi-tenant event delivery.&lt;/p&gt; 
&lt;p&gt;To learn more about the services used in this post, visit:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;a href="https://aws.amazon.com/managed-service-apache-flink/" target="_blank" rel="noopener"&gt;Amazon Managed Service for Apache Flink&lt;/a&gt;&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://nightlies.apache.org/flink/flink-docs-stable/docs/dev/datastream/fault-tolerance/broadcast_state/" target="_blank" rel="noopener"&gt;Apache Flink Broadcast State Pattern&lt;/a&gt;&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html" target="_blank" rel="noopener"&gt;Amazon DynamoDB Streams&lt;/a&gt;&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/sns/latest/dg/sns-message-filtering.html" target="_blank" rel="noopener"&gt;Amazon SNS Message Filtering&lt;/a&gt;&lt;/li&gt; 
&lt;/ul&gt; 
&lt;hr&gt; 
&lt;h2&gt;About the authors&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt; 
   &lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-91160" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/10/Emre.png" alt="" width="1016" height="1008"&gt;&lt;/p&gt; 
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Emre Kartoglu&lt;/h3&gt; 
  &lt;p&gt;Emre is a Principal Engineer at Smartsheet, where he led the adoption of Apache Flink to power features such as sheet history, live collaboration, and automations. Previously a software engineer at AWS helping build Amazon Managed Service for Apache Flink, he is an active open-source contributor to the Apache Flink project, including FLIP-418 for data skew calculation and visualization and the AWS connectors for Apache Flink. He is based in London, UK, and is passionate about operational excellence, building products, establishing mechanisms, and using data to drive positive culture change.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt; 
   &lt;p&gt;&lt;img loading="lazy" class="alignleft " src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/07/Rony-profile-picture-mid2.png" alt="Rony Blum" width="100" height="120"&gt;&lt;/p&gt; 
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Rony Blum&lt;/h3&gt; 
  &lt;p&gt;Rony Blum is a Senior Solutions Architect at AWS based in Seattle, working with ISV customers to design and implement advanced cloud architectures, specializing in SaaS solutions, multi-tenant systems, and Generative AI applications. Outside of work, Rony enjoys exploring the Pacific Northwest trails on foot and hitting the slopes during ski season.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt; 
   &lt;p&gt;&lt;img loading="lazy" class="alignleft size-full" src="https://d2908q01vomqb2.cloudfront.net/fb644351560d8296fe6da332236b1f8d61b2828a/2025/10/06/Francisco-Morillo.jpg" alt="Francisco Morillo" width="100" height="100"&gt;&lt;/p&gt; 
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Francisco Morillo&lt;/h3&gt; 
  &lt;p&gt;Francisco Morillo is a Sr. Streaming Solutions Architect at AWS, specializing in real-time analytics architectures. With over five years in the streaming data space, Francisco has worked as a data analyst for startups and as a big data engineer for consultancies, building streaming data pipelines. He has deep expertise in Amazon Managed Streaming for Apache Kafka (Amazon MSK) and Amazon Managed Service for Apache Flink.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
		
		
			</item>
		<item>
		<title>Optimize Amazon S3 Tables queries with Amazon Redshift</title>
		<link>https://aws.amazon.com/blogs/big-data/optimize-amazon-s3-tables-queries-with-amazon-redshift/</link>
		
		<dc:creator><![CDATA[Tom Romano]]></dc:creator>
		<pubDate>Thu, 14 May 2026 16:58:31 +0000</pubDate>
				<category><![CDATA[Amazon Redshift]]></category>
		<category><![CDATA[Amazon S3 Tables]]></category>
		<category><![CDATA[Analytics]]></category>
		<guid isPermaLink="false">90a84f0b5b502e8e31dbc0dcfe595a3574d309b0</guid>

					<description>This is the third post in our S3 Tables and Amazon Redshift series. The first post covered getting started with querying Apache Iceberg tables, and the second post walked through enterprise-scale governance and access controls. In this post, you address those performance and usability gaps with three different approaches.</description>
										<content:encoded>&lt;p&gt;&lt;a href="https://aws.amazon.com/s3/features/tables/" target="_blank" rel="noopener"&gt;Amazon S3 Tables&lt;/a&gt; with &lt;a href="https://aws.amazon.com/redshift/" target="_blank" rel="noopener"&gt;Amazon Redshift&lt;/a&gt; gives you a powerful combination for analytical workloads on Apache Iceberg tables. But as query volumes grow, small inefficiencies compound. For example, repeated queries, such as dashboards refreshing hourly or analysts running the same joins throughout the day, scan data directly from &lt;a href="https://aws.amazon.com/s3/" target="_blank" rel="noopener"&gt;Amazon Simple Storage Service (Amazon S3)&lt;/a&gt; every time. The fully qualified three-part table references (&lt;code&gt;database@catalog.schema.table&lt;/code&gt;) add friction for business intelligence (BI) tools and end users who expect simpler SQL syntax. And without tuning the way S3 Tables organizes your data files, queries read more files than necessary. When you address these three areas, your S3 Tables queries in Amazon Redshift become faster, simpler, and more cost-efficient, whether you’re powering a recurring dashboard or supporting ad hoc analysis at scale.&lt;/p&gt; 
&lt;p&gt;This is the third post in our S3 Tables and Amazon Redshift series. The &lt;a href="https://aws.amazon.com/blogs/big-data/using-amazon-s3-tables-with-amazon-redshift-to-query-apache-iceberg-tables/" target="_blank" rel="noopener"&gt;first post&lt;/a&gt; covered getting started with querying &lt;a href="https://iceberg.apache.org/" target="_blank" rel="noopener"&gt;Apache Iceberg&lt;/a&gt; tables, and the &lt;a href="https://aws.amazon.com/blogs/big-data/scalable-analytics-and-centralized-governance-for-apache-iceberg-tables-using-amazon-s3-tables-and-amazon-redshift/" target="_blank" rel="noopener"&gt;second post&lt;/a&gt; walked through enterprise-scale governance and access controls. In this post, you address those performance and usability gaps with three approaches:&lt;/p&gt; 
&lt;ol type="1"&gt; 
 &lt;li&gt;Create external schemas to simplify queries from three-part notation down to two-part notation.&lt;/li&gt; 
 &lt;li&gt;Build materialized views that store pre-computed results locally so repeated queries skip the S3 scan.&lt;/li&gt; 
 &lt;li&gt;Configure S3 Tables compaction strategies so the data file layout matches your query patterns.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;The following diagram shows how these three approaches work together. External schemas [1] simplify query syntax through &lt;a href="https://docs.aws.amazon.com/lake-formation/latest/dg/resource-links-about.html" target="_blank" rel="noopener"&gt;AWS Lake Formation resource links&lt;/a&gt; [2], materialized views [3] store pre-computed results locally in Amazon Redshift, and S3 Tables compaction [4] optimizes the underlying file layout for your query patterns.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/07/BDB-5744-1.png" alt="Optimizing S3 Tables queries with external schemas, materialized views, and compaction strategies" width="600"&gt;&lt;/p&gt; 
&lt;h2&gt;Prerequisites&lt;/h2&gt; 
&lt;p&gt;Before you begin, make sure you have:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;An AWS account with permissions to manage &lt;a href="https://aws.amazon.com/iam/" target="_blank" rel="noopener"&gt;AWS Identity and Access Management&lt;/a&gt; (IAM) roles, &lt;a href="https://aws.amazon.com/lake-formation/" target="_blank" rel="noopener"&gt;AWS Lake Formation&lt;/a&gt;, S3 Tables, and Redshift.&lt;/li&gt; 
 &lt;li&gt;An &lt;a href="https://aws.amazon.com/redshift/redshift-serverless/" target="_blank" rel="noopener"&gt;Amazon Redshift Serverless&lt;/a&gt; workgroup or Amazon Redshift provisioned cluster (patch 188 or higher).&lt;/li&gt; 
 &lt;li&gt;An S3 Table bucket with a &lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-namespace-create.html" target="_blank" rel="noopener"&gt;namespace&lt;/a&gt; and &lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-create.html" target="_blank" rel="noopener"&gt;tables&lt;/a&gt; created.&lt;/li&gt; 
 &lt;li&gt;Lake Formation configured with the &lt;a href="https://docs.aws.amazon.com/redshift/latest/mgmt/using-service-linked-roles.html" target="_blank" rel="noopener"&gt;AWSServiceRoleForRedshift service-linked role&lt;/a&gt; as a read-only administrator.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;If you haven’t completed these steps, follow the setup instructions in the &lt;a href="https://aws.amazon.com/blogs/big-data/using-amazon-s3-tables-with-amazon-redshift-to-query-apache-iceberg-tables/" target="_blank" rel="noopener"&gt;first post in this series&lt;/a&gt;.&lt;/p&gt; 
&lt;h2&gt;Simplify queries with external schemas&lt;/h2&gt; 
&lt;p&gt;The previous posts in this series used the auto-mounted catalog to query S3 Tables with three-part notation:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;SELECT * FROM redshifticeberg@s3tablescatalog.icebergsons3.examples;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;You can use this syntax, but it can be cumbersome in business intelligence (BI) tools, manually typing queries, and in application code. This syntax also requires the user to use IAM federation. By creating an external schema, you can reference the same tables with a concise two-part notation:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;SELECT * FROM s3tables_schema.examples;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;To set this up, you create a Lake Formation resource link that maps to your S3 Tables catalog, then create an external schema in Amazon Redshift that points to that resource link. Your setup differs slightly depending on whether your users authenticate through IAM federation or database credentials. While this doesn’t change query performance, it removes a common barrier to adoption by simplifying the reference.&lt;/p&gt; 
&lt;h3&gt;Create a Lake Formation resource link&lt;/h3&gt; 
&lt;p&gt;Both authentication methods require a resource link in Lake Formation that points to your S3 Tables database.&lt;/p&gt; 
&lt;ol type="1"&gt; 
 &lt;li&gt;In the Lake Formation console, choose &lt;strong&gt;Databases&lt;/strong&gt; under &lt;strong&gt;Data Catalog&lt;/strong&gt;.&lt;/li&gt; 
 &lt;li&gt;On the &lt;strong&gt;Create&lt;/strong&gt; menu, choose &lt;strong&gt;Resource link&lt;/strong&gt;.&lt;/li&gt; 
 &lt;li&gt;Configure the resource link with the following settings: 
  &lt;ul&gt; 
   &lt;li&gt;&lt;strong&gt;Resource link name:&lt;/strong&gt; &lt;code&gt;s3tables_rl&lt;/code&gt;&lt;/li&gt; 
   &lt;li&gt;&lt;strong&gt;Destination Catalog:&lt;/strong&gt; Your account ID (for example, &lt;code&gt;111122223333&lt;/code&gt;)&lt;/li&gt; 
   &lt;li&gt;&lt;strong&gt;Shared Database:&lt;/strong&gt; Your S3 Tables database (for example, &lt;code&gt;icebergsons3&lt;/code&gt;)&lt;/li&gt; 
   &lt;li&gt;&lt;strong&gt;Shared Database’s Catalog ID:&lt;/strong&gt; Your S3 Table bucket in the format &lt;code&gt;111122223333:s3tablescatalog/redshifticeberg&lt;/code&gt;&lt;/li&gt; 
  &lt;/ul&gt; &lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/07/BDB-5744-2.png" alt="Resource link creation in Lake Formation with catalog ID and shared database configured" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;For more information, see &lt;a href="https://docs.aws.amazon.com/lake-formation/latest/dg/creating-resource-links.html" target="_blank" rel="noopener"&gt;Creating resource links&lt;/a&gt; in the Lake Formation documentation.&lt;/p&gt; 
&lt;h3&gt;Option A: External schema for IAM federated users&lt;/h3&gt; 
&lt;p&gt;If your users connect to Amazon Redshift through IAM federation, &lt;a href="https://docs.aws.amazon.com/redshift/latest/dg/r_CREATE_EXTERNAL_SCHEMA.html" target="_blank" rel="noopener"&gt;create the external schema&lt;/a&gt; with the &lt;code&gt;SESSION&lt;/code&gt; keyword. This passes the federated user’s credentials through to Lake Formation for access control:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;CREATE EXTERNAL SCHEMA s3tables_schema
FROM DATA CATALOG
DATABASE 's3tables_rl'
CATALOG_ID '111122223333'
IAM_ROLE 'SESSION'
CATALOG_ROLE 'SESSION';&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Lake Formation evaluates your permissions based on your federated user’s IAM role, and sees only the tables and columns their role allows. This is the recommended approach for new deployments because it provides fine-grained access control without additional role management.&lt;/p&gt; 
&lt;h3&gt;Option B: External schema for database users&lt;/h3&gt; 
&lt;p&gt;External applications like Tableau, PowerBI, and custom ETL tools often authenticate with database credentials instead of IAM federation. These users need an IAM role to access S3 Tables on their behalf.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Create an IAM service role to access S3 Tables:&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;You create a role (for example, &lt;code&gt;S3TableAccessRole&lt;/code&gt;) with a trust policy that allows Amazon Redshift to assume it:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "redshift.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;You then attach the following permission policies to the role:&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;A policy for Lake Formation data access (substitute your 12-digit AWS Account ID for&lt;/em&gt; &lt;code&gt;YOUR_ACCOUNT_ID&lt;/code&gt;&lt;em&gt;):&lt;/em&gt;&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "lakeformation:GetDataAccess",
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "YOUR_ACCOUNT_ID"
                }
            }
        },
        {
            "Effect": "Deny",
            "Action": "lakeformation:PutDataLakeSettings",
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "YOUR_ACCOUNT_ID"
                }
            }
        }
    ]
}&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;A policy for &lt;em&gt;&lt;a href="https://docs.aws.amazon.com/glue/latest/dg/catalog-and-crawler.html" target="_blank" rel="noopener"&gt;AWS Glue Data Catalog&lt;/a&gt; access (substitute the appropriate AWS Region for&lt;/em&gt; &lt;code&gt;REGION_ID&lt;/code&gt; &lt;em&gt;and your 12-digit AWS Account ID for&lt;/em&gt; &lt;code&gt;YOUR_ACCOUNT_ID&lt;/code&gt;&lt;em&gt;):&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;For production, scope these permissions to your specific resources and AWS Region.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "glue:GetTable",
                "glue:GetTables",
                "glue:GetTableVersion",
                "glue:GetTableVersions",
                "glue:GetTags"
            ],
            "Resource": [
                "arn:aws:glue:REGION_ID:YOUR_ACCOUNT_ID:catalog",
                "arn:aws:glue:REGION_ID:YOUR_ACCOUNT_ID:database/*",
                "arn:aws:glue:REGION_ID:YOUR_ACCOUNT_ID:table/*/*"
            ]
        }
    ]
}&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;&lt;strong&gt;Grant Lake Formation permissions to the role:&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;In the Lake Formation console, grant the &lt;code&gt;S3TableAccessRole&lt;/code&gt; DESCRIBE access on the database and SELECT access on the tables for your resource link. For detailed steps, see &lt;a href="https://docs.aws.amazon.com/lake-formation/latest/dg/granting-lake-formation-permissions.html" target="_blank" rel="noopener"&gt;Granting Lake Formation permissions&lt;/a&gt;.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/07/BDB-5744-3.png" alt="Lake Formation DESCRIBE permission on resource link database" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/07/BDB-5744-4.png" alt="Lake Formation SELECT permission on tables" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Associate the role and create the schema:&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;First, associate the IAM role with your Amazon Redshift cluster or workgroup. For instructions, see &lt;a href="https://docs.aws.amazon.com/redshift/latest/mgmt/authorizing-redshift-service.html" target="_blank" rel="noopener"&gt;Associating IAM roles with Amazon Redshift&lt;/a&gt;.&lt;/p&gt; 
&lt;p&gt;Create the external schema:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;CREATE EXTERNAL SCHEMA s3tables_schema
FROM DATA CATALOG
DATABASE 's3tables_rl'
IAM_ROLE 'arn:aws:iam::111122223333:role/S3TableAccessRole';&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Then grant access to your database users:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;GRANT USAGE ON SCHEMA s3tables_schema TO my_database_user;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;h3&gt;Query with two-part notation&lt;/h3&gt; 
&lt;p&gt;With either option, you can now query S3 Tables using the simpler two-part notation:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;SELECT * FROM s3tables_schema.examples LIMIT 10;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/07/BDB-5744-5.png" alt="Query results showing two-part notation returning rows from the examples table" width="600"&gt;&lt;/p&gt; 
&lt;p&gt;You can use this notation in BI tools, JDBC/ODBC connections, and application code and no longer need to know the underlying catalog structure.&lt;/p&gt; 
&lt;h2&gt;Accelerate queries with materialized views&lt;/h2&gt; 
&lt;p&gt;When you repeatedly query S3 Tables, each execution scans the external data from S3. Materialized views store pre-computed results in Amazon Redshift, so subsequent queries read from local storage instead of scanning S3 on every run.&lt;/p&gt; 
&lt;p&gt;Redshift supports incremental refresh for materialized views on Apache Iceberg tables, including INSERT, DELETE, UPDATE, and table compaction operations. After the initial creation, Amazon Redshift processes only the rows that changed since the last refresh when you run subsequent refreshes, rather than recomputing the full result set. This helps reduce both the time and compute cost of keeping your views current, especially for large tables with frequent changes.&lt;/p&gt; 
&lt;p&gt;Materialized views have general limitations and considerations when used with external data lake tables. For details, see &lt;a href="https://docs.aws.amazon.com/redshift/latest/dg/materialized-view-external-table.html" target="_blank" rel="noopener"&gt;Materialized views on external data lake tables&lt;/a&gt;.&lt;/p&gt; 
&lt;h3&gt;Create a materialized view on S3 Tables&lt;/h3&gt; 
&lt;p&gt;The following example &lt;a href="https://docs.aws.amazon.com/redshift/latest/dg/materialized-view-create-sql-command.html" target="_blank" rel="noopener"&gt;creates a materialized view&lt;/a&gt; that joins the &lt;code&gt;examples&lt;/code&gt; table in S3 Tables with a local &lt;code&gt;categories&lt;/code&gt; table in Amazon Redshift. You can use a materialized view to pre-compute daily record counts and data samples per category:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;CREATE MATERIALIZED VIEW mv_daily_category_summary
DISTSTYLE KEY
DISTKEY (category_id)
SORTKEY (insert_date)
AS
SELECT
    c.category_id,
    c.department,
    e.insert_date,
    COUNT(*) AS record_count,
    COUNT(DISTINCT e.id) AS unique_ids
FROM s3tables_schema.examples e
JOIN public.categories c
  ON c.category_id = e.category_id
GROUP BY c.category_id, c.department, e.insert_date;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Query the materialized view directly:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;SELECT category_id, department, insert_date, record_count
FROM mv_daily_category_summary
ORDER BY record_count DESC
LIMIT 10;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Your query can now read from local Amazon Redshift storage and typically returns results without scanning S3 Tables:&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/07/BDB-5744-6.png" alt="Query results from the materialized view showing category data with record counts" width="600"&gt;&lt;/p&gt; 
&lt;h3&gt;Refresh strategies&lt;/h3&gt; 
&lt;p&gt;You have two options for keeping materialized views current:&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;Automatic refresh:&lt;/em&gt; Set &lt;code&gt;AUTO REFRESH YES&lt;/code&gt; in the view definition to have Amazon Redshift automatically refresh the view in the background when it detects changes to the base tables. This is a good fit for dashboards and reports that can tolerate a short delay between data changes and query results. Note that automatic refresh requires Option B (database user) when creating the external schema, and the default is &lt;code&gt;AUTO REFRESH NO&lt;/code&gt;.&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;Manual refresh:&lt;/em&gt; Run &lt;code&gt;REFRESH MATERIALIZED VIEW&lt;/code&gt; when you need to control the timing:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;REFRESH MATERIALIZED VIEW mv_daily_category_summary;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Use manual refresh when you need to coordinate updates with data loading pipelines or when you want to refresh during off-peak hours.&lt;/p&gt; 
&lt;h2&gt;Tune S3 Tables compaction for your query patterns&lt;/h2&gt; 
&lt;p&gt;S3 Tables automatically compacts small Parquet files into larger ones in the background. This compaction reduces the number of read requests your query engine must make, which can improve query performance. By default, compaction targets a file size of 512 MB, configurable between 64 MB and 512 MB. Four compaction strategies are available, and choosing the right one for your query patterns can make a measurable difference.&lt;/p&gt; 
&lt;h3&gt;Compaction strategies&lt;/h3&gt; 
&lt;table border="1px" width="100%" cellpadding="10px"&gt; 
 &lt;tbody&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;strong&gt;Strategy&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;When to use&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;How it works&lt;/strong&gt;&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;strong&gt;Auto&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;You want S3 to decide for you&lt;/td&gt; 
   &lt;td&gt;Selects sort compaction for sorted tables, binpack for unsorted tables&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;strong&gt;Binpack&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;General-purpose workloads, unsorted tables&lt;/td&gt; 
   &lt;td&gt;Combines small files into larger files (100 MB+) and applies pending row-level deletes&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;strong&gt;Sort&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;Queries frequently filter on a single column (e.g., &lt;code&gt;insert_date&lt;/code&gt;)&lt;/td&gt; 
   &lt;td&gt;Organizes data by the table’s sort-order columns during compaction&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;strong&gt;Z-order&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;Queries filter on two or more columns together (e.g., &lt;code&gt;insert_date&lt;/code&gt; and &lt;code&gt;category_id&lt;/code&gt;)&lt;/td&gt; 
   &lt;td&gt;Blends multiple column values into a single scalar for sorting&lt;/td&gt; 
  &lt;/tr&gt; 
 &lt;/tbody&gt; 
&lt;/table&gt; 
&lt;p&gt;Binpack improves performance by reducing the number of files a query engine reads. Sort compaction goes further. By ordering data within files, it enables query engines to skip entire files based on column min/max metadata during predicate pushdown. This is effective for queries that filter on the sort column, such as date-range filters. Z-order extends this benefit to queries that filter on multiple columns simultaneously, at the cost of slightly less efficient pruning on any single column compared to a pure sort.&lt;/p&gt; 
&lt;p&gt;To use sort or z-order compaction, you first need to verify that the table is sorted by one (sort) or multiple (z-order) columns:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;-- Sort
ALTER TABLE icebergsons3.examples WRITE ORDERED BY insert_date;

-- Z-Order
ALTER TABLE icebergsons3.examples WRITE ORDERED BY insert_date,category_id;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;h3&gt;Configure a compaction strategy&lt;/h3&gt; 
&lt;p&gt;To change the compaction strategy for a table, use the &lt;code&gt;PutTableMaintenanceConfiguration&lt;/code&gt; API through the &lt;a href="https://aws.amazon.com/cli/" target="_blank" rel="noopener"&gt;AWS Command Line Interface (AWS CLI)&lt;/a&gt;:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;aws s3tables put-table-maintenance-configuration \
    --table-bucket-arn arn:aws:s3tables:us-east-1:111122223333:bucket/redshifticeberg \
    --type icebergCompaction \
    --namespace icebergsons3 \
    --name examples \
    --value '{"status":"enabled","settings":{"icebergCompaction":{"strategy":"sort"}}}'&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;To adjust the target file size (for example, to 256 MB):&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;aws s3tables put-table-maintenance-configuration \
    --table-bucket-arn arn:aws:s3tables:us-east-1:111122223333:bucket/redshifticeberg \
    --type icebergCompaction \
    --namespace icebergsons3 \
    --name examples \
    --value '{"status":"enabled","settings":{"icebergCompaction":{"targetFileSizeMB":256}}}'&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Similar to the “sort” example, you can specify &lt;code&gt;{"strategy":"z-order"}&lt;/code&gt; for z-order compaction.&lt;/p&gt; 
&lt;p&gt;For more detail on sort and z-order, see &lt;a href="https://aws.amazon.com/blogs/aws/new-improve-apache-iceberg-query-performance-in-amazon-s3-with-sort-and-z-order-compaction/" target="_blank" rel="noopener"&gt;Improve Apache Iceberg query performance in Amazon S3 with sort and z-order compaction&lt;/a&gt;.&lt;/p&gt; 
&lt;h3&gt;Snapshot management&lt;/h3&gt; 
&lt;p&gt;S3 Tables manage snapshots automatically. By default, it keeps a minimum of 1 snapshot and expires snapshots older than 120 hours (5 days). The snapshot retention is customized by setting &lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-maintenance.html#s3-tables-maintenance-snapshot" target="_blank" rel="noopener"&gt;minSnapshotsToKeep and maxSnapshotAgeHours&lt;/a&gt;. After a snapshot reaches the expiration time you configured in your retention settings, S3 Tables marks objects that only that snapshot references as noncurrent and removes them based on the unreferenced file removal policy.&lt;/p&gt; 
&lt;p&gt;You can adjust these settings if your workload needs more snapshots for time-travel queries or longer retention:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;aws s3tables put-table-maintenance-configuration \
    --table-bucket-arn arn:aws:s3tables:us-east-1:111122223333:bucket/redshifticeberg \
    --namespace icebergsons3 \
    --name examples \
    --type icebergSnapshotManagement \
    --value '{"status":"enabled","settings":{"icebergSnapshotManagement":{"minSnapshotsToKeep":10,"maxSnapshotAgeHours":2500}}}'&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Keep in mind that retaining more snapshots increases storage costs. If a materialized view references an expired snapshot, Amazon Redshift falls back to a full recompute on the next refresh. Therefore, snapshot retention can directly affect your materialized view refresh behavior. Balance snapshot retention with your materialized view refresh frequency to avoid unnecessary full recomputes.&lt;/p&gt; 
&lt;p&gt;For more information, see &lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-maintenance.html" target="_blank" rel="noopener"&gt;Maintenance for tables&lt;/a&gt; in the Amazon S3 documentation.&lt;/p&gt; 
&lt;h2&gt;Best practices&lt;/h2&gt; 
&lt;p&gt;&lt;strong&gt;Choose the right access pattern for your users.&lt;/strong&gt; Use IAM federation with &lt;code&gt;SESSION&lt;/code&gt; credentials for new applications and interactive users. Reserve the IAM role approach for BI tools and extract, transform, and load (ETL) pipelines that can’t integrate with IAM federation directly. Plan to migrate database users to federated access over time.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Match compaction strategy to query patterns.&lt;/strong&gt; Use sort compaction when your queries filter on a single column (such as date ranges). Use z-order when queries filter on two or more columns together. Stick with the auto default if your query patterns vary or you’re unsure.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Size materialized views for your refresh window.&lt;/strong&gt; Materialized views that join large external tables with local tables take longer to refresh. If your data changes frequently, keep the materialized view focused on the specific aggregations your dashboards need rather than materializing entire tables.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Coordinate snapshot retention with materialized view refresh.&lt;/strong&gt; If a materialized view references an expired Iceberg snapshot, Amazon Redshift performs a full recompute instead of an incremental refresh. Set your snapshot retention (&lt;code&gt;maxSnapshotAgeHours&lt;/code&gt;) longer than your materialized view refresh interval.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Monitor compaction with &lt;a href="https://aws.amazon.com/cloudtrail/" target="_blank" rel="noopener"&gt;AWS CloudTrail&lt;/a&gt;.&lt;/strong&gt; S3 Tables logs compaction operations as CloudTrail management events. Track these to verify that compaction runs on schedule and to identify tables that might benefit from a different strategy.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Balance performance gains against storage costs.&lt;/strong&gt; Materialized views store pre-computed results in Amazon Redshift, adding to your managed storage. Compaction reduces file counts, but z-order and sort compaction can increase overall storage because of data duplication across sort boundaries. Review your Amazon Redshift managed storage usage and S3 Tables storage metrics periodically to make sure the performance benefits justify the additional storage utilization.&lt;/p&gt; 
&lt;h2&gt;Troubleshooting&lt;/h2&gt; 
&lt;table border="1px" width="100%" cellpadding="10px"&gt; 
 &lt;tbody&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;strong&gt;Issue&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Resolution&lt;/strong&gt;&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;“Permission denied” when creating the external schema&lt;/td&gt; 
   &lt;td&gt;Verify the IAM role has &lt;code&gt;lakeformation:GetDataAccess&lt;/code&gt; permission. Confirm you associated the role with your Amazon Redshift cluster or workgroup. Also check that you granted the role access to the resource link database and its tables in Lake Formation.&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;“Schema not found” or “Database not found” errors&lt;/td&gt; 
   &lt;td&gt;Confirm the resource link name in Lake Formation matches the &lt;code&gt;DATABASE&lt;/code&gt; value in your &lt;code&gt;CREATE EXTERNAL SCHEMA&lt;/code&gt; statement. Verify the catalog ID format uses the pattern &lt;code&gt;account_id:s3tablescatalog/bucket_name&lt;/code&gt;.&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;“Table not found” when querying through the external schema&lt;/td&gt; 
   &lt;td&gt;Check that Lake Formation permissions include table-level access, not just database-level. Verify the table exists in the S3 Tables catalog by querying it through the auto-mounted catalog first.&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Materialized view refresh falls back to full recompute&lt;/td&gt; 
   &lt;td&gt;Check if the referenced Iceberg snapshot has expired. Increase &lt;code&gt;maxSnapshotAgeHours&lt;/code&gt; in the snapshot management configuration. Verify that the base table hasn’t exceeded 4 million position deletes in a single data file. Compaction resolves this.&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Queries on S3 Tables are slow after data loading&lt;/td&gt; 
   &lt;td&gt;Compaction runs on an automated schedule and may not have processed recent writes yet. Check CloudTrail for the latest compaction event. Verify the compaction strategy matches your query patterns. Switch from binpack to sort if you filter on specific columns.&lt;/td&gt; 
  &lt;/tr&gt; 
 &lt;/tbody&gt; 
&lt;/table&gt; 
&lt;h2&gt;Cleaning up&lt;/h2&gt; 
&lt;p&gt;To avoid ongoing costs, remove the resources you created in this walkthrough:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;-- Drop materialized views
DROP MATERIALIZED VIEW IF EXISTS mv_daily_category_summary;

-- Drop external schemas
DROP SCHEMA IF EXISTS s3tables_schema;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Also remove:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;The IAM role (&lt;code&gt;S3TableAccessRole&lt;/code&gt;) and its attached policies, if you created one for database users.&lt;/li&gt; 
 &lt;li&gt;The Lake Formation resource link and associated permissions.&lt;/li&gt; 
 &lt;li&gt;The &lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-buckets-delete.html" target="_blank" rel="noopener"&gt;S3 table bucket&lt;/a&gt;, if you no longer need the data.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;Conclusion&lt;/h2&gt; 
&lt;p&gt;In this post, we showed how to optimize S3 Tables queries from Amazon Redshift using three approaches: external schemas that simplify query syntax from three-part to two-part notation, making it easier for BI tools and end users to work with S3 Tables. We also covered materialized views for pre-computed analytical results that reduce repeated S3 scans, and S3 Tables compaction strategies tuned to your query patterns for more efficient file access.&lt;/p&gt; 
&lt;p&gt;For new applications, design your access layer with IAM federation and external schemas from the start. Use materialized views to accelerate repeated analytical queries that join S3 Tables with local Amazon Redshift data. Match your compaction strategy to how your team queries the data. Use sort compaction for date-range filters and z-order when queries filter on multiple columns at once. Furthermore, the same S3 tables you optimize here are also accessible from Amazon Athena, Amazon EMR, and third-party engines.&lt;/p&gt; 
&lt;p&gt;To learn more, see the &lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables.html" target="_blank" rel="noopener"&gt;Amazon S3 Tables documentation&lt;/a&gt;, &lt;a href="https://docs.aws.amazon.com/redshift/latest/dg/materialized-view-overview.html" target="_blank" rel="noopener"&gt;Materialized views in Amazon Redshift&lt;/a&gt;, and &lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-maintenance.html" target="_blank" rel="noopener"&gt;S3 Tables maintenance&lt;/a&gt;. We welcome your feedback in the comments.&lt;/p&gt; 
&lt;h2&gt;About the authors&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt; 
   &lt;p&gt;&lt;img loading="lazy" class="alignleft size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/07/BDB-5744-7.png" alt="Tom Romano" width="100" height="100"&gt;&lt;/p&gt; 
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Tom Romano&lt;/h3&gt; 
  &lt;p&gt;Tom Romano is a Senior Solutions Architect for AWS World Wide Public Sector based in Tampa, FL. He works with GovTech customers to build solutions using serverless architectures, generative AI, and modern data and DevOps practices. In his free time, Tom flies remote control model airplanes and enjoys vacationing with his family around Florida and the Caribbean.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt; 
   &lt;p&gt;&lt;img loading="lazy" class="alignleft size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/07/BDB-5744-8.png" alt="Satesh Sonti" width="100" height="100"&gt;&lt;/p&gt; 
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Satesh Sonti&lt;/h3&gt; 
  &lt;p&gt;Satesh Sonti is a Principal Analytics Specialist Solutions Architect based out of Atlanta, specializing in building enterprise data platforms, data warehousing, and analytics solutions. He has over 20 years of experience in building data assets and leading complex data platform programs for banking and insurance clients across the globe.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
		
		
			</item>
		<item>
		<title>Securing client confidentiality at scale: Automated data discovery and governed analytics for legal workloads</title>
		<link>https://aws.amazon.com/blogs/big-data/securing-client-confidentiality-at-scale-automated-data-discovery-and-governed-analytics-for-legal-workloads/</link>
					
		
		<dc:creator><![CDATA[Rohan Kamat]]></dc:creator>
		<pubDate>Wed, 13 May 2026 15:57:14 +0000</pubDate>
				<category><![CDATA[Advanced (300)]]></category>
		<category><![CDATA[Amazon Macie]]></category>
		<category><![CDATA[Amazon Quick Suite]]></category>
		<category><![CDATA[Amazon Simple Notification Service (SNS)]]></category>
		<category><![CDATA[Amazon Simple Storage Service (S3)]]></category>
		<category><![CDATA[Analytics]]></category>
		<category><![CDATA[Artificial Intelligence]]></category>
		<category><![CDATA[AWS Glue]]></category>
		<category><![CDATA[AWS Lake Formation]]></category>
		<category><![CDATA[AWS Security Hub]]></category>
		<category><![CDATA[Technical How-to]]></category>
		<guid isPermaLink="false">99e1696fc52f579762912b853853132c5d6dde6d</guid>

					<description>In this post, we show you a reference architecture that automates sensitive data discovery across legal document repositories on Amazon Web Services (AWS), demonstrate how to capture structured findings as a compliance dataset, and guide you through building a governed analytics workspace that maintains your security boundaries. You walk away with a practical model for building security and analytics into the same lifecycle, without moving documents outside their system of record.</description>
										<content:encoded>&lt;p&gt;Automating data security and analytics for legal documents presents a unique challenge when your legal team stores documents with strong access controls, organized by client and matter, encrypted at rest, and governed by well-defined policies. But what happens when you want to run analytics across those repositories? The typical path is extracting content into separate data pipelines or third-party tools, which fragments your governance model and introduces new risks. Law firms and corporate legal departments operate under distinct obligations that make data governance non-negotiable. Attorney-client privilege, work product doctrine, and professional conduct rules impose strict duties around how client information is handled, accessed, and disclosed. Governance failure in this context isn’t just a compliance gap, it can result in privilege waiver, disqualification from representation, or disciplinary action.&lt;/p&gt; 
&lt;p&gt;Legal professionals use &lt;em&gt;ethical walls&lt;/em&gt;, also called &lt;em&gt;information barriers&lt;/em&gt;, as structural safeguards that prevent the flow of confidential information between teams within a firm that represent adverse or potentially conflicting interests. Professional conduct rules mandate these barriers, and failure to maintain them can result in firm disqualification, malpractice liability, or regulatory sanctions.&lt;/p&gt; 
&lt;p&gt;Privilege boundaries are equally critical. Attorney-client privilege and work product protection apply only when you properly control access to the underlying material. If you expose privileged documents or metadata about their contents to unauthorized individuals, you risk losing your privilege protection. When organizations fail to maintain reasonable controls over privileged material, courts might find that they have waived their privilege. You should therefore actively manage your access governance, not only as a security concern but as a legal preservation requirement.When you extract content into separate analytics systems or grant broader access than your matter structures support, you create pressure on both protections. You gain visibility but lose confidence in your controls.&lt;/p&gt; 
&lt;p&gt;In this post, we show you a reference architecture that automates sensitive data discovery across legal document repositories on Amazon Web Services (AWS), demonstrate how to capture structured findings as a compliance dataset, and guide you through building a governed analytics workspace that maintains your security boundaries. You walk away with a practical model for building security and analytics into the same lifecycle, without moving documents outside their system of record.&lt;/p&gt; 
&lt;h2&gt;&lt;strong&gt;Analytics shouldn’t weaken governance&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;Most legal organizations have invested heavily in securing their document repositories. You store documents in structured storage, organized by client and matter. You access controls map to matter boundaries (the organizational and access structures that separate one client engagement from another). You establish retention and hold policies.The difficulty starts when teams want to analyze what’s inside those repositories. Running analytics typically means copying content into a separate system, standing up a new data pipeline, or granting broader access than existing matter structures support. Each of these steps introduces governance gaps. Manual reporting fills some of the void, but it doesn’t scale and can’t provide continuous visibility. What’s missing is a model where security controls and analytics reinforce each other, where the act of discovering sensitive data also produces the dataset that you use for reporting, and where governance applies once and carries through every downstream operation.&lt;/p&gt; 
&lt;p&gt;Automation addresses this by combining continuous sensitive data discovery with governed analytics, built on discovery metadata rather than document copies. This automated approach delivers four key advantages:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;No document movement.&lt;/strong&gt;&amp;nbsp;Your files stay in their system of record. Analytics runs against structured discovery metadata, not document content, so governance boundaries remain intact.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Continuous discovery instead of manual scanning.&lt;/strong&gt;&amp;nbsp;Automated classification identifies regulated and sensitive information on an ongoing basis, replacing periodic manual reviews with on demand visibility.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Unified governance.&lt;/strong&gt;&amp;nbsp;You define matter-aligned access policies once, and they carry through from document storage to findings analytics and compliance reporting.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Built-in audit readiness.&lt;/strong&gt;&amp;nbsp;A durable record of discovery findings and remediation actions accumulates automatically over time, giving you structured evidence for client reviews and regulatory inquiries.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;&lt;strong&gt;Reference Architecture&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;The following architecture shows how continuous discovery, governance, and compliance operations can work together without copying legal documents into analytics systems.&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90724" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/30/BDB-5730-image-1.png" alt="This reference architecture illustrates how law firms and corporate legal departments can automate sensitive data discovery and compliance analytics on AWS without moving documents outside their system of record" width="919" height="642"&gt;&lt;/p&gt; 
&lt;h2&gt;&lt;strong&gt;Architecture walkthrough&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;&lt;strong&gt;Store and protect documents in Amazon Simple Storage Service (Amazon S3)&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;Store your legal documents in&amp;nbsp;&lt;a href="https://aws.amazon.com/s3/" target="_blank" rel="noopener noreferrer"&gt;Amazon S3&lt;/a&gt;, which serves as the system of record for document content. Align your buckets and prefixes to client and matter structures so that access controls map directly to matter boundaries. Where your retention or legal hold requirements demand it, apply&amp;nbsp;&lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-lock.html" target="_blank" rel="noopener noreferrer"&gt;S3 Object Lock&lt;/a&gt;&amp;nbsp;to enforce immutability. You can encrypt your data using&amp;nbsp;&lt;a href="https://aws.amazon.com/kms/" target="_blank" rel="noopener noreferrer"&gt;AWS Key Management Service (AWS KMS)&lt;/a&gt;, which gives you centralized control over encryption keys and policies.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Discover and classify sensitive data with Amazon Macie&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;You will configure&amp;nbsp;&lt;a href="https://aws.amazon.com/macie/" target="_blank" rel="noopener noreferrer"&gt;Amazon Macie&lt;/a&gt;&amp;nbsp;to continuously analyze your document repositories. Macie identifies regulated information such as personally identifiable information (PII), financial data, and other sensitive content and produces structured findings that describe what Macie identified and where it exists. This provides ongoing visibility into data exposure without requiring document movement or manual scanning.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Catalog and govern findings with AWS Glue and AWS Lake Formation&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;You will use&amp;nbsp;&lt;a href="https://aws.amazon.com/glue/" target="_blank" rel="noopener noreferrer"&gt;AWS Glue&lt;/a&gt;&amp;nbsp;to catalog the findings dataset and maintain its schema so it stays query-ready. Apply&amp;nbsp;&lt;a href="https://aws.amazon.com/lake-formation/" target="_blank" rel="noopener noreferrer"&gt;AWS Lake Formation&lt;/a&gt;&amp;nbsp;tag-based policies to govern access, aligning tags to client, matter, and confidentiality tier. This approach enforces ethical walls and least-privilege access consistently across analytics and reporting activities.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;AI-powered chat agent using Amazon Quick Suite&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;You can create custom chat agents to tailor conversational interfaces for specific legal business needs. These agents can be configured with legal-specific knowledge bases, connected to relevant document repositories, and customized with instructions appropriate for legal workflows. You can use this chat agent to interact with your legal documents through natural language conversation for capabilities like:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;b&gt;E-Discovery:&lt;/b&gt;Search and analyze large volumes of legal documents to quickly find relevant information across your document repository.&lt;/li&gt; 
 &lt;li&gt;&lt;b&gt;Contract Analysis:&lt;/b&gt;Review contracts and automatically extract key terms, clauses, and obligations to streamline your contract review process.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;The chat agent can help you navigate complex document sets through conversational queries, making legal research and document review more efficient and accessible.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Analyze and report with Amazon Quick Sight&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;You will use &lt;a href="https://aws.amazon.com/quick/" target="_blank" rel="noopener noreferrer"&gt;Amazon Quick&lt;/a&gt; as your compliance operations workspace. Quick provides a unified environment where your teams can query findings, generate dashboards, track remediation actions, and produce audit-ready reports. The agentic AI capabilities of Amazon Quick can autonomously build analyses, surface anomalies across matters, generate executive summaries for client reviews, and proactively recommend remediation priorities based on finding severity and trends. Combined with built-in data stories for automated narrative generation and pixel-perfect paginated reports for regulatory submissions, Quick reduces the time from discovery to action while keeping your teams within a governed interface aligned to matter-based permissions. Rather than switching between separate visualization, workflow, and reporting tools, your legal and compliance teams can review findings, manage response activities, and collaborate all within a single workspace that respects ethical walls and privilege boundaries.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Escalate high-severity findings&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;For high-severity findings that demand immediate attention, route alerts through&amp;nbsp;&lt;a href="https://aws.amazon.com/security-hub/" target="_blank" rel="noopener noreferrer"&gt;AWS Security Hub&lt;/a&gt;&amp;nbsp;or&amp;nbsp;&lt;a href="https://aws.amazon.com/sns/" target="_blank" rel="noopener noreferrer"&gt;Amazon Simple Notification Service (Amazon SNS)&lt;/a&gt;&amp;nbsp;to trigger escalation workflows. This connects visibility directly to action when your teams identify sensitive data risks.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Why this approach works for legal&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;Documents stay where they belong.&amp;nbsp;Your files remain in Amazon S3, aligned to client and matter boundaries. No content moves into separate analytics pipelines.Ethical walls remain intact.&amp;nbsp;Because analytics is built on discovery findings and not document copies, you can govern access to findings using the same matter-aligned controls that apply to documents. Compliance and security teams gain visibility without expanding document access.Discovery runs continuously, not periodically.&amp;nbsp;Rather than scheduling quarterly or annual scans, you maintain a current view of sensitive data across your repositories.&lt;/p&gt; 
&lt;p&gt;Governance applies once and carries through.&amp;nbsp;Lake Formation tag-based policies govern findings access at the catalog level. You define your matter and confidentiality mappings once, and they carry through to every dashboard, query, and report.Audit readiness is built in.&amp;nbsp;Instead of assembling reports manually before a client review or regulatory inquiry, you maintain a historical record of discovery findings and remediation actions. You can demonstrate your posture over time with consistent, structured evidence.&lt;/p&gt; 
&lt;p&gt;Security and analytics reinforce each other.&amp;nbsp;Your analytics capability is built on top of your security controls, not alongside them. Strengthening one strengthens the other.&lt;/p&gt; 
&lt;h2&gt;&lt;strong&gt;Cost considerations&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;The primary cost drivers for this architecture include:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Amazon Macie:&lt;/strong&gt;&amp;nbsp;You pay based on the number of S3 buckets evaluated and the volume of data inspected for sensitive data discovery. Review&amp;nbsp;&lt;a href="https://aws.amazon.com/macie/pricing/" target="_blank" rel="noopener noreferrer"&gt;Amazon Macie pricing&lt;/a&gt;&amp;nbsp;for current rates.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Amazon S3:&lt;/strong&gt;&amp;nbsp;Storage costs for both your document repositories and the compliance intelligence bucket. Consider S3 lifecycle policies to tier older findings into lower-cost storage classes.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;AWS Glue and AWS Lake Formation:&lt;/strong&gt;&amp;nbsp;Charges for crawlers and catalog storage. For most implementations, these costs are modest.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Amazon QuickSight:&lt;/strong&gt;&amp;nbsp;Per-user pricing based on the edition that you select (Standard or Enterprise). Enterprise edition supports row-level and column-level security, which aligns well with matter-based governance.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Amazon EventBridge, AWS Security Hub, and Amazon SNS:&lt;/strong&gt;&amp;nbsp;Charges based on event volume and notifications delivered. For findings-based workflows, these costs are generally low.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;Use the&amp;nbsp;&lt;a href="https://calculator.aws/" target="_blank" rel="noopener noreferrer"&gt;AWS Pricing Calculator&lt;/a&gt;&amp;nbsp;to estimate costs based on your repository size, user count, and discovery frequency.&lt;/p&gt; 
&lt;h2&gt;&lt;strong&gt;Getting started&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;Start by identifying a representative set of document repositories in Amazon S3. We recommend that you start with two or three matters that span different practice areas and confidentiality tiers.&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;&lt;strong&gt;Turn on Amazon Macie&lt;/strong&gt;&amp;nbsp;for those repositories and configure automated sensitive data discovery.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Catalog the findings dataset&lt;/strong&gt;&amp;nbsp;with AWS Glue and apply Lake Formation tag-based access policies aligned to your matter structure.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Build your first Amazon Quick Sight dashboard&lt;/strong&gt;&amp;nbsp;to visualize findings by matter, sensitivity type, and severity.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Define escalation rules&lt;/strong&gt;&amp;nbsp;in AWS Security Hub or Amazon SNS for high-severity findings.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;After you validate this workflow against your initial repositories, expand gradually. Add more repositories to Macie discovery. Refine your governance tags to reflect practice areas and confidentiality tiers. Extend your dashboards from basic posture visibility to trend analysis and remediation tracking.The goal isn’t to build a comprehensive analytics solution all at once. Start with a secure foundation where discovery findings, governance, and reporting operate together in a way that aligns with your legal workflows, and then expand from there.&lt;/p&gt; 
&lt;h2&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;You don’t have to choose between protecting client data and understanding it. By building analytics on top of governed discovery findings and using a unified compliance workspace, you gain visibility into your data posture without weakening confidentiality boundaries.This approach brings security, governance, and analytics together in a way that reflects how legal work is actually structured. It provides continuous visibility, supports audit readiness, and delivers insight without requiring documents to move outside their system of record.&lt;/p&gt; 
&lt;h3&gt;&lt;strong&gt;Next steps&lt;/strong&gt;&lt;/h3&gt; 
&lt;p&gt;Review the&amp;nbsp;&lt;a href="https://docs.aws.amazon.com/macie/latest/user/what-is-macie.html" target="_blank" rel="noopener noreferrer"&gt;Amazon Macie User Guide&lt;/a&gt;&amp;nbsp;to understand sensitive data discovery configuration options and &lt;a href="https://docs.aws.amazon.com/quicksight/" target="_blank" rel="noopener noreferrer"&gt;Amazon Quick Sight documentation&lt;/a&gt;&amp;nbsp;to evaluate dashboard and row-level security capabilities.&lt;/p&gt; 
&lt;p&gt;Contact your&amp;nbsp;&lt;a href="https://aws.amazon.com/contact-us/" target="_blank" rel="noopener noreferrer"&gt;AWS account team&lt;/a&gt;&amp;nbsp;to discuss implementation support for legal and compliance workloads.&lt;/p&gt; 
&lt;hr style="width: 80%"&gt; 
&lt;h2&gt;About the authors&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignnone size-full wp-image-90766" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/30/BDB-5730-image-2-1-e1777588892859.png" alt="Photo of Author - Rohan Kamat" width="100" height="116"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;&lt;strong&gt;Rohan Kamat&lt;/strong&gt;&lt;/h3&gt; 
  &lt;p&gt;Rohan Kamat is a Solutions Architecture Leader within HCLS with extensive experience in cloud architecture, cybersecurity, Identity and Access Management, and enterprise networking. Rohan focuses on helping architects build both depth in cloud technologies and strength in executive communication, making sure they can confidently guide organizations through business and technical transformation. Outside of his professional work, Rohan enjoys time with his family, organizing community cricket events, and exploring fitness and wellness activities like pickleball and ping pong. He also enjoys planning travel experiences that bring people together and create lasting shared memories.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignnone size-full wp-image-90726" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/30/BDB-5730-image-3.jpeg" alt="Photo of Author- Miguel Lopez Luis" width="100" height="133"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;&lt;strong&gt;Miguel Lopez Luis&lt;/strong&gt;&lt;/h3&gt; 
  &lt;p&gt;Miguel Lopez Luis is an AWS Solutions Architect who works with small and medium businesses across the United States. He graduated with a Bachelor’s degree in Cybersecurity from Bellevue University in Nebraska and is a member of the Omega Nu Lambda Honor Society. Leveraging his extensive expertise in business management, Miguel is passionate about planning strategic initiatives, leading cross-functional teams, and mentoring others. In his personal time, he enjoys activities that involve travel, sports, and cooking.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignnone size-full wp-image-90727" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/30/BDB-5730-image-4.jpeg" alt="Photo of Author - Pranali Khose" width="100" height="133"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;&lt;strong&gt;Pranali Khose&lt;/strong&gt;&lt;/h3&gt; 
  &lt;p&gt;Pranali Khose is an AWS Solutions Architect based in Seattle. She works directly with small and medium business (SMB) customers across the United States, to design and implement cloud solutions that address their unique business challenges and accelerate digital transformation. Pranali holds a Master of Science in Computer Science from the University of Texas at Arlington.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
					
			
		
		
			</item>
		<item>
		<title>Streamlined monitoring and debugging for Amazon EMR on EC2</title>
		<link>https://aws.amazon.com/blogs/big-data/streamlined-monitoring-and-debugging-for-amazon-emr-on-ec2/</link>
					
		
		<dc:creator><![CDATA[Parul Saxena]]></dc:creator>
		<pubDate>Tue, 12 May 2026 15:59:22 +0000</pubDate>
				<category><![CDATA[Amazon EMR]]></category>
		<category><![CDATA[Analytics]]></category>
		<category><![CDATA[AWS Big Data]]></category>
		<category><![CDATA[Best Practices]]></category>
		<category><![CDATA[Monitoring and observability]]></category>
		<category><![CDATA[Technical How-to]]></category>
		<guid isPermaLink="false">b8dc238b255037fa7d65ca154c626c0e662d22ab</guid>

					<description>In this post, we walk you through five key enhancements: Amazon CloudWatch Logs integration, step-level Amazon Simple Storage Service (Amazon S3) logging controls, expanded console UIs for YARN and Tez, Amazon EMR step to YARN application ID mapping, and enhanced custom metrics with updated documentation.</description>
										<content:encoded>&lt;p&gt;As organizations scale their data processing and analytics workloads on Amazon EMR on EC2, observability across cluster health, job execution, and resource usage becomes increasingly important. Teams often manage log collection across distributed nodes, correlate Amazon EMR steps with underlying YARN applications, and configure monitoring agents to capture the right level of detail for their environment.&lt;/p&gt; 
&lt;p&gt;With Amazon EMR release 7.11.0 and updates to the Amazon EMR console, Amazon EMR on EC2 introduces observability capabilities that streamline these workflows further. In this post, we walk you through five key enhancements: Amazon CloudWatch Logs integration, step-level Amazon Simple Storage Service (Amazon S3) logging controls, expanded console UIs for YARN and Tez, Amazon EMR step to YARN application ID mapping, and enhanced custom metrics with updated documentation.&lt;/p&gt; 
&lt;h2&gt;What’s new&lt;/h2&gt; 
&lt;p&gt;The following sections cover key improvements across the Amazon EMR console, logging, metrics collection, and documentation to give you deeper, end-to-end visibility into your Amazon EMR clusters and workloads.&lt;/p&gt; 
&lt;h3&gt;1. CloudWatch Logs integration&lt;/h3&gt; 
&lt;p&gt;Starting with Amazon EMR release 7.11.0, you can stream cluster logs to Amazon CloudWatch Logs in near real time without requiring custom bootstrap actions or manual agent configuration. With &lt;a href="https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-logging-cw.html" target="_blank" rel="noopener noreferrer"&gt;Amazon CloudWatch logging enabled&lt;/a&gt;, Amazon EMR automatically captures and streams Amazon EMR step execution logs, Spark driver, and Spark executor logs as they’re generated. This makes them immediately available for monitoring, troubleshooting, and post-mortem analysis through the CloudWatch console or API.&lt;/p&gt; 
&lt;p&gt;You can enable CloudWatch logging through the Amazon EMR console during cluster creation or programmatically using the AWS Command Line Interfaced (AWS CLI) and SDK by including the Amazon CloudWatch Agent in your application configuration and specifying your logging preferences in the configuration section.&lt;/p&gt; 
&lt;p&gt;With minimal configuration, Amazon EMR captures step logs and Spark driver logs by default, streaming them to a log group named &lt;code&gt;/aws/emr/{cluster_id}&lt;/code&gt;. For production workloads requiring stricter organizational and security controls, you can customize the log group name, define a log stream prefix for streamlined filtering, enable encryption with an AWS Key Management Service (AWS KMS) key, and explicitly select which log types to capture. The following example demonstrates a fully customized configuration:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;aws emr create-cluster
--name "EMR cluster with custom CloudWatch Logs"
--release-label emr-7.11.0
--applications Name=Spark Name=AmazonCloudWatchAgent
--instance-type m7g.2xlarge
--instance-count 3
--use-default-roles
--monitoring-configuration '
"CloudWatchLogConfiguration":
"Enabled": true,
"LogGroupName": "/my-company/emr/production",
"LogStreamNamePrefix": "cluster-prod",
"EncryptionKeyArn": "arn:aws:kms:us-east-1:123456789012:key/12345678-1234-1234-1234-123456789012",
"LogTypes": {
"STEP_LOGS": ["STDOUT", "STDERR"],
"SPARK_DRIVER": ["STDOUT", "STDERR"],
"SPARK_EXECUTOR": ["STDERR", "STDOUT"]
}
}
}'&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;This configuration directs the logs to a custom log group (&lt;code&gt;/my-company/emr/production&lt;/code&gt;), prefixes log stream names with &lt;code&gt;cluster-prod&lt;/code&gt; for consistent identification across clusters, encrypts log data at rest using the specified KMS key, and captures the full set of available log types: step stdout/stderr, Spark driver, and Spark executor output. Because logs are streamed to CloudWatch as they’re written, you have near real-time visibility into job execution without waiting for log aggregation to S3 or establishing direct connectivity to cluster nodes. Combined with CloudWatch Logs Insights, you can run structured querying across log streams, making it straightforward to trace failures, correlate errors across driver and executor logs, and build metric filters or alarms based on specific log patterns.&lt;/p&gt; 
&lt;h3&gt;2. Step-level S3 logging improvements&lt;/h3&gt; 
&lt;p&gt;S3 logging capabilities now provide &lt;a href="https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-debugging-step-log-customization.html" target="_blank" rel="noopener noreferrer"&gt;granular control over how step logs are organized and secured&lt;/a&gt;. You can now specify a dedicated S3 log destination and AWS KMS encryption key at the individual Amazon EMR step level. This allows different steps within the same cluster to write logs to separate S3 paths with independent encryption configurations. This is particularly useful for multi-tenant clusters or workflows with varying data classification requirements.&lt;/p&gt; 
&lt;p&gt;Step-level logging is configured through the &lt;code&gt;StepMonitoringConfiguration&lt;/code&gt; parameter, which accepts an &lt;code&gt;S3MonitoringConfiguration&lt;/code&gt; object where you can define the target S3 path and an AWS KMS key for encryption at rest:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;"StepMonitoringConfiguration": { "S3MonitoringConfiguration": { "LogUri": "s3://your-s3-bucket/", "EncryptionKeyArn": "arn:aws:kms:your-kms-key-arn" } }&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;This configuration is optional. When omitted, the step inherits the &lt;a href="https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-plan-debugging.html" target="_blank" rel="noopener noreferrer"&gt;default S3 log path and encryption settings defined at the cluster level during creation&lt;/a&gt;. With this configuration, you can override logging behavior only for the steps that require it, while maintaining a consistent default for the rest of your workflow.&lt;/p&gt; 
&lt;h3&gt;3. Enhanced console with direct access to monitoring UIs&lt;/h3&gt; 
&lt;p&gt;Additional live application UIs are accessible directly from the Amazon EMR Console. These console-hosted interfaces remove the need to configure SSH (Secure Shell) tunnels, set up proxies, or establish any direct network connectivity to cluster nodes to reach application web UIs. The newly added interfaces include:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;YARN ResourceManager UI –&lt;/strong&gt; Monitor cluster-wide resource allocation, queue usage, and application lifecycle states across running and completed YARN applications. This interface also provides direct access to container-level logs for running YARN applications, enabling real-time debugging without requiring node-level access.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Tez UI –&lt;/strong&gt; Inspect Hive query execution plans, DAG visualizations, vertex-level performance metrics, and task-level counters for queries executed through the Tez execution engine (for example, Hive and Pig workloads).&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;These join the existing Spark History Server and YARN timeline interfaces already available through the console. By surfacing these UIs, administrators can grant developers and analysts visibility into cluster workloads and application diagnostics without exposing direct network access to cluster infrastructure while maintaining tighter security boundaries and preserving full observability into job execution and resource consumption.&lt;/p&gt; 
&lt;p&gt;With these additions, Amazon EMR now offers three complementary approaches to accessing application web interfaces, each suited to different operational requirements. Live Application UIs provide console-hosted access to web interfaces on running clusters. They’re recommended for environments where direct network connectivity to cluster nodes must be restricted from end users. &lt;a href="https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-web-interfaces.html" target="_blank" rel="noopener noreferrer"&gt;On-Cluster Web UIs&lt;/a&gt; offer full, unrestricted access to the complete set of native application web interfaces running on cluster nodes, suited for administrators and engineers who require deep, low-level visibility. &lt;a href="https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-cluster-application-history.html" target="_blank" rel="noopener noreferrer"&gt;Persistent Web UIs&lt;/a&gt; retain application-level data beyond cluster lifetime, so you can analyze and troubleshoot workloads on terminated clusters. Together, these options give you the flexibility to balance security boundaries, access scope, and data retention based on your team’s specific monitoring and debugging workflows.&lt;/p&gt; 
&lt;h3&gt;4. EMR step to YARN application ID mapping&lt;/h3&gt; 
&lt;p&gt;The Amazon EMR console now surfaces the &lt;a href="https://docs.aws.amazon.com/emr/latest/ManagementGuide/debug-emr-yarn.html" target="_blank" rel="noopener noreferrer"&gt;YARN Application ID directly within the EMR step&lt;/a&gt; details panel. For each step executing a Spark, Hive, or other YARN-based workload, the console displays the submitted YARN Application ID associated with that step, establishing a direct link between the EMR step abstraction and the underlying YARN application. With this mapping, you can:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Directly correlate EMR steps to YARN applications – &lt;/strong&gt;when a step fails or exhibits unexpected behavior, you can immediately identify the exact YARN application to investigate rather than manually cross-referencing timestamps or job names across interfaces.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Access live monitoring tools –&lt;/strong&gt; with the YARN application ID readily available, you can navigate directly to the YARN ResourceManager Live UI or the Spark History Server to inspect resource consumption, task-level execution details, and application state for both running and completed jobs.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Retrieve logs for detailed troubleshooting – &lt;/strong&gt;the application ID serves as the key lookup for retrieving container-level logs persisted to Amazon S3, significantly reducing the time to root-cause failures or diagnose performance regressions.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;To use this feature, open the &lt;strong&gt;Steps&lt;/strong&gt; tab on your Amazon EMR cluster detail page and select the step that you want to investigate. The YARN Application ID appears in the step details panel. From there, you can use the ID to navigate to the YARN ResourceManager Live UI at &lt;code&gt;&lt;a href="http://resourcemanager-host:8088/cluster/app/%3capplication_id" target="_blank" rel="noopener noreferrer"&gt;http://resourcemanager-host:8088/cluster/app/&amp;lt;application_id&lt;/a&gt;&amp;gt;&lt;/code&gt;, open the corresponding view in the Spark History Server, or locate the associated container logs in your configured S3 log destination.&lt;/p&gt; 
&lt;h3&gt;5. Enhanced custom metrics and observability documentation&lt;/h3&gt; 
&lt;p&gt;By default, Amazon EMR automatically sends cluster-level metrics to &lt;a href="https://docs.aws.amazon.com/emr/latest/ManagementGuide/UsingEMR_ViewingMetrics.html" target="_blank" rel="noopener noreferrer"&gt;Amazon CloudWatch at five-minute intervals&lt;/a&gt;, covering YARN application states, node health, HDFS utilization, and I/O activity. With Amazon EMR Release 7.0 and later, enabling the &lt;a href="https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-AmazonCloudWatchAgent.html" target="_blank" rel="noopener noreferrer"&gt;Amazon CloudWatch Agent&lt;/a&gt; extends this baseline with additional detailed metrics collected at one-minute intervals across cluster nodes. Furthermore, Amazon EMR 7.1 introduced custom metric classifications that you can use to define precisely which component-level metrics to collect from Hadoop, YARN, and HBase subsystems, like DataNode I/O activity, NodeManager JVM heap utilization, container resource consumption, and HBase performance counters. Each classification supports configurable export intervals, giving you control over collection granularity based on your monitoring requirements.&lt;/p&gt; 
&lt;p&gt;After enabled, custom metrics are accessible directly from the &lt;strong&gt;Monitoring&lt;/strong&gt; tab in the Amazon EMR console, where you can use a classification filter to switch between HDFS, YARN, HBase custom metric groupings that you’ve defined. Metric configurations can also be updated on running clusters through the console’s reconfiguration workflow, so you can adapt your monitoring strategy as workload requirements evolve without cluster downtime. For environments using Prometheus, metrics can also be forwarded to Amazon Managed Service for Prometheus and visualized through Grafana dashboards.&lt;/p&gt; 
&lt;p&gt;The following documentation and tutorials are available to help you get the most out of these capabilities:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/emr/latest/ManagementGuide/enhanced-custom-metrics.html" target="_blank" rel="noopener noreferrer"&gt;&lt;strong&gt;Enhanced Custom Metrics Guide&lt;/strong&gt;&lt;/a&gt; provides step-by-step instructions for configuring CloudWatch Agent to publish custom metrics.&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-metrics-observability.html" target="_blank" rel="noopener noreferrer"&gt;&lt;strong&gt;EMR Observability Best Practices&lt;/strong&gt;&lt;/a&gt; provides a comprehensive guide covering monitoring strategies, metric selection, and troubleshooting workflows.&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/emr/latest/ManagementGuide/enhanced-custom-metrics-application-status.html" target="_blank" rel="noopener noreferrer"&gt;&lt;strong&gt;Service Status Monitoring&lt;/strong&gt;&lt;/a&gt; provides a tutorial on monitoring and publishing Amazon EMR application status.&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/emr/latest/ManagementGuide/enhanced-custom-metrics-applications.html" target="_blank" rel="noopener noreferrer"&gt;&lt;strong&gt;Monitor Apache Spark applications on Amazon EMR with Amazon CloudWatch&lt;/strong&gt;&lt;/a&gt; provides a tutorial to publish detailed Spark metrics to CloudWatch and identify performance bottlenecks in Spark application.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;Getting started&lt;/h2&gt; 
&lt;p&gt;These observability improvements are available now for Amazon EMR on EC2. To get started:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;&lt;strong&gt;CloudWatch Logs integration and step-level log configuration&lt;/strong&gt;: To use these capabilities, launch a new cluster with Amazon EMR release 7.11.0 or later.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;For console enhancements&lt;/strong&gt;: Navigate to your existing Amazon EMR clusters in the AWS Console to access Live Application UI links and YARN Application ID mappings in step details, with no additional configuration required.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;For custom metrics&lt;/strong&gt;: Review our &lt;a href="http://docs.aws.amazon.com/emr/latest/ManagementGuide/enhanced-custom-metrics.html" target="_blank" rel="noopener noreferrer"&gt;Enhanced Custom Metrics documentation&lt;/a&gt; to configure the CloudWatch Agent for publishing Hadoop, YARN, and HBase component metrics using custom classification files.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;h2&gt;Conclusion&lt;/h2&gt; 
&lt;p&gt;With these enhancements, Amazon EMR on EC2 provides deeper visibility into cluster health, job execution, and resource usage, helping you reduce time to root cause and focus on delivering value from your data. Note that enabling CloudWatch Logs integration and custom metrics incurs additional CloudWatch charges based on log ingestion volume and metric publishing frequency.&lt;/p&gt; 
&lt;p&gt;If you have feedback or questions, reach out to your AWS account team or post on the &lt;a href="https://repost.aws/" target="_blank" rel="noopener noreferrer"&gt;AWS re:Post&lt;/a&gt;.&lt;/p&gt; 
&lt;hr style="width: 80%"&gt; 
&lt;h2&gt;About the authors&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="size-full wp-image-91037 alignleft" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/06/parul.jpg" alt="" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;p&gt;&lt;strong&gt;Parul Saxena&lt;/strong&gt;&lt;br&gt; &lt;a href="https://www.linkedin.com/in/parulsaxena27/" target="_blank" rel="noopener noreferrer"&gt;Parul&lt;/a&gt; is a Senior Big Data Specialist Solutions Architect at Amazon Web Services (AWS). She helps customers and partners build highly optimized, scalable, and secure solutions. She specializes in Amazon EMR, Amazon Athena, and AWS Lake Formation, providing architectural guidance for complex big data workloads and assisting organizations in modernizing their architectures and migrating analytics workloads to AWS.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="size-full wp-image-72477 alignleft" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2024/11/19/ravi-kumar.png" alt="" width="100" height="130"&gt;
  &lt;/div&gt; 
  &lt;p&gt;&lt;strong&gt;Ravi Kumar Singh&lt;/strong&gt;&lt;br&gt; &lt;a href="https://www.linkedin.com/in/ravikumarsingh19/" target="_blank" rel="noopener noreferrer"&gt;Ravi Kumar Singh&lt;/a&gt; is a Senior Product Manager Technical-ES (PMT) at Amazon Web Services, specializing in exabyte-scale data infrastructure and analytics platforms. He helps customers unlock insights from their data using open-source technologies and cloud computing for AI/ML use cases. Outside of work, Ravi enjoys exploring emerging trends in data science and machine learning.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="size-full wp-image-33305 alignleft" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2022/08/16/lorenzo-ripani.png" alt="" width="100" height="133"&gt;
  &lt;/div&gt; 
  &lt;p&gt;&lt;strong&gt;Lorenzo Ripani&lt;/strong&gt;&lt;br&gt; &lt;a href="https://www.linkedin.com/in/ripani" target="_blank" rel="noopener noreferrer"&gt;Lorenzo Ripani&lt;/a&gt; is a Big Data Solution Architect at AWS. He is passionate about distributed systems, open-source technologies, and security. He spends most of his time working with customers around the world to design, evaluate and optimize scalable and secure data pipelines with Amazon EMR.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="size-full wp-image-91038 alignleft" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/06/Arun.jpg" alt="" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;p&gt;&lt;strong&gt;Arun Prabakaran&lt;/strong&gt;&lt;br&gt; &lt;a href="https://www.linkedin.com/in/arprab/" target="_blank" rel="noopener noreferrer"&gt;Arun Prabakaran&lt;/a&gt; is a Senior Software Engineer working at AWS. His expertise spans distributed data processing and large-scale systems. He is passionate about building reliable data platforms and enabling organizations to run analytics and AI workloads at scale.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="size-full wp-image-91039 alignleft" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/06/Jason.jpg" alt="" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;p&gt;&lt;strong&gt;Jason Zou&lt;/strong&gt;&lt;br&gt; &lt;a href="https://www.linkedin.com/in/jasonpzou/" target="_blank" rel="noopener noreferrer"&gt;Jason Zou&lt;/a&gt; is a Software Development Engineer at Amazon Web Services, where he works on internal infrastructure supporting EMR clusters. He is passionate about building scalable, fault-tolerant distributed systems. Outside of work, he enjoys photography and playing basketball.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="size-full wp-image-91040 alignleft" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/06/Justin.jpg" alt="" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;p&gt;&lt;strong&gt;Justin Mae&lt;/strong&gt;&lt;br&gt; Justin Mae is a Software Development Engineer on the Amazon EMR team at Amazon Web Services. He works on EMR on EC2’s control plane, building systems that improve cluster performance, observability, and operational reliability.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
					
			
		
		
			</item>
		<item>
		<title>Detect and resolve HBase inconsistencies faster with AI on Amazon EMR</title>
		<link>https://aws.amazon.com/blogs/big-data/detect-and-resolve-hbase-inconsistencies-faster-with-ai-on-amazon-emr/</link>
					
		
		<dc:creator><![CDATA[Yu-Ting Su]]></dc:creator>
		<pubDate>Tue, 12 May 2026 15:56:41 +0000</pubDate>
				<category><![CDATA[Amazon EMR]]></category>
		<category><![CDATA[Amazon OpenSearch Service]]></category>
		<category><![CDATA[Analytics]]></category>
		<category><![CDATA[Customer Solutions]]></category>
		<category><![CDATA[Experience-Based Acceleration]]></category>
		<category><![CDATA[Kiro]]></category>
		<category><![CDATA[Amazon OpenSearch]]></category>
		<category><![CDATA[Apache HBase]]></category>
		<category><![CDATA[EMR]]></category>
		<guid isPermaLink="false">db9396e58fef011dcd89131f3ab418a1d8dd68f3</guid>

					<description>In this post, we show you how to build an AI-powered troubleshooting solution using Amazon OpenSearch Service vector search and intelligent analysis. This solution reduces HBase inconsistency resolution from hours to minutes and root cause identification from days to hours through natural language queries over operational data. This democratizes HBase troubleshooting capabilities across teams and reducing dependency on specialized expertise.</description>
										<content:encoded>&lt;p&gt;&lt;a href="https://hbase.apache.org/book.html" target="_blank" rel="noopener"&gt;HBase&lt;/a&gt; operations teams spend hours manually correlating logs, metadata, and consistency reports to identify root causes. Traditional approaches require deep expertise and extensive investigation across scattered data sources, directly impacting MTTR and operational efficiency. As HBase deployments scale and expertise becomes increasingly scarce, organizations face mounting pressure to maintain service reliability while managing growing operational complexity. The manual nature of troubleshooting creates bottlenecks that delay incident resolution, increase operational costs, and risk service degradation during critical business periods.&lt;/p&gt; 
&lt;p&gt;In this post, we show you how to build an AI-powered troubleshooting solution using &lt;a href="https://aws.amazon.com/opensearch-service/" target="_blank" rel="noopener"&gt;Amazon OpenSearch Service&lt;/a&gt; vector search and intelligent analysis. This solution reduces HBase inconsistency resolution from hours to minutes and root cause identification from days to hours through natural language queries over operational data. This democratizes HBase troubleshooting capabilities across teams and reducing dependency on specialized expertise.&lt;/p&gt; 
&lt;h2&gt;Solution overview&lt;/h2&gt; 
&lt;p&gt;The solution addresses HBase troubleshooting challenges through data processing, vector search, and AI-powered analysis. It processes operational data from &lt;a href="https://aws.amazon.com/emr/" target="_blank" rel="noopener noreferrer"&gt;Amazon EMR&lt;/a&gt; clusters, generates semantic vector embeddings, and enables natural language queries for intelligent troubleshooting.&lt;br&gt; Key components include:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Amazon EMR HBase:&lt;/strong&gt; Runs HBase workloads with &lt;a href="https://aws.amazon.com/s3/" target="_blank" rel="noopener noreferrer"&gt;Amazon S3&lt;/a&gt; as the HBase rootdir for durable, scalable storage&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Data Processing&lt;/strong&gt;: Extracts and processes HBase logs, &lt;a href="https://github.com/apache/hbase-operator-tools/blob/master/hbase-hbck2/README.md" target="_blank" rel="noopener noreferrer"&gt;HBCK&lt;/a&gt; reports, and metadata with vector embeddings&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Amazon OpenSearch Service&lt;/strong&gt;: Provides vector search capabilities with k-NN algorithms for semantic analysis&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;AI Analysis Interface&lt;/strong&gt;: Enables natural language queries with context-aware recommendations&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Custom Knowledge Base&lt;/strong&gt;: Supports organization-specific runbooks and troubleshooting procedures by ingesting Git repositories via &lt;a href="https://kiro.dev/" target="_blank" rel="noopener noreferrer"&gt;Kiro CLI&lt;/a&gt;‘s &lt;code&gt;/knowledge add&lt;/code&gt; command, enabling the AI assistant to reference custom operational guides alongside HBase source code and operational tools&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-55491.png" target="_blank" rel="noopener"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90179" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-55491.png" alt="AWS cloud architecture diagram showing an HBase log analysis system with EMR cluster, VPC networking, IAM roles, Lambda functions, OpenSearch domain, and supporting services for scalable log processing and analytics." width="1000" height="573"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;p&gt;The preceding diagram illustrates how the HBase log analysis system troubleshoots inconsistencies through automated workflows across AWS services.&lt;/p&gt; 
&lt;p&gt;When an operations team needs to investigate HBase issues, the engineer connects over SSH to the Amazon EMR primary node and runs the error collection script, which gathers logs from HBase master and RegionServer nodes and uploads them to Amazon S3. Next, the engineer connects to the Analytics &lt;a href="https://aws.amazon.com/ec2/" target="_blank" rel="noopener"&gt;Amazon Elastic Compute Cloud (Amazon EC2)&lt;/a&gt; instance and executes the automated processing script, which downloads logs from Amazon S3, generates semantic vector embeddings, and injects them into Amazon OpenSearch Service for k-NN-based semantic search. The engineer then queries the Kiro CLI AI Assistant using natural language to investigate. Kiro searches Amazon OpenSearch Service for relevant log entries and uses &lt;a href="https://aws.amazon.com/bedrock/" target="_blank" rel="noopener"&gt;Amazon Bedrock&lt;/a&gt; to analyze patterns, correlate errors across components, and provide actionable recommendations. This reduces troubleshooting time from hours to minutes. The system operates within an &lt;a href="https://aws.amazon.com/vpc/" target="_blank" rel="noopener"&gt;Amazon Virtual Private Cloud (Amazon VPC)&lt;/a&gt; with private subnets for Amazon EMR and Analytics &lt;a href="https://aws.amazon.com/ec2/" target="_blank" rel="noopener"&gt;Amazon EC2&lt;/a&gt;, &lt;a href="https://aws.amazon.com/iam/" target="_blank" rel="noopener"&gt;AWS Identity and Access Management (AWS IAM)&lt;/a&gt; roles for access control, Parameter Store for configuration, and &lt;a href="https://aws.amazon.com/cloudwatch/" target="_blank" rel="noopener"&gt;Amazon CloudWatch&lt;/a&gt; for monitoring.&lt;/p&gt; 
&lt;h2&gt;Prerequisites&lt;/h2&gt; 
&lt;p&gt;For this walkthrough, you need the following prerequisites:&lt;/p&gt; 
&lt;h3&gt;AWS account setup&lt;/h3&gt; 
&lt;ul&gt; 
 &lt;li&gt;An &lt;a href="https://signin.aws.amazon.com/signin?redirect_uri=https%3A%2F%2Fportal.aws.amazon.com%2Fbilling%2Fsignup%2Fresume&amp;amp;client_id=signup" target="_blank" rel="noopener noreferrer"&gt;AWS account&lt;/a&gt; with administrative access for initial deployment&lt;/li&gt; 
 &lt;li&gt;AWS Command Line Interface (AWS CLI) configured with administrative credentials&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h3&gt;Required AWS IAM permissions&lt;/h3&gt; 
&lt;h4&gt;For infrastructure deployment&lt;/h4&gt; 
&lt;p&gt;Your deployment user or role needs the following permissions:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Your deployment user or role requires sufficient access to &lt;a href="https://aws.amazon.com/cloudformation/" target="_blank" rel="noopener noreferrer"&gt;AWS CloudFormation&lt;/a&gt;, Amazon S3, AWS IAM, and &lt;a href="https://aws.amazon.com/systems-manager/" target="_blank" rel="noopener noreferrer"&gt;AWS System Manager&lt;/a&gt;.&lt;/li&gt; 
 &lt;li&gt;The user or role must have the ability to create AWS CloudFormation stacks.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;&lt;strong&gt;Infrastructure deployment:&lt;/strong&gt;&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;For infrastructure deployment, you need AWS CloudFormation stack management permissions.&lt;/li&gt; 
 &lt;li&gt;You also require sufficient access to create and manage the following resources: 
  &lt;ul&gt; 
   &lt;li&gt;Amazon OpenSearch Service domains&lt;/li&gt; 
   &lt;li&gt;Amazon EC2 instances, Amazon VPCs, security groups, and networking components&lt;/li&gt; 
   &lt;li&gt;AWS IAM roles and policies&lt;/li&gt; 
   &lt;li&gt;AWS Systems Manager Parameter Store entries&lt;/li&gt; 
   &lt;li&gt;Amazon CloudWatch Logs groups&lt;/li&gt; 
   &lt;li&gt;Amazon S3 bucket for access logs and session logs&lt;/li&gt; 
  &lt;/ul&gt; &lt;/li&gt; 
&lt;/ul&gt; 
&lt;h4&gt;Runtime service roles&lt;/h4&gt; 
&lt;p&gt;The AWS CloudFormation stack automatically creates two specialized AWS IAM roles designed with least-privilege access principles.&lt;/p&gt; 
&lt;p&gt;The first role is the Amazon OpenSearch Service Role, which manages Amazon VPC networking and Amazon CloudWatch logging for the Amazon OpenSearch Service domain.&lt;/p&gt; 
&lt;p&gt;The second role is the Application Role, which provides minimal Amazon OpenSearch Service and Amazon S3 access specifically for log processing applications and secure log ingestion operations.&lt;/p&gt; 
&lt;h3&gt;Network requirements&lt;/h3&gt; 
&lt;ul&gt; 
 &lt;li&gt;Amazon VPC with private subnets for secure Amazon OpenSearch Service deployment&lt;/li&gt; 
 &lt;li&gt;NAT Gateway for outbound internet access from private subnets&lt;/li&gt; 
 &lt;li&gt;Security groups configured for HTTPS-only communication&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h3&gt;Running Kiro CLI on Amazon EC2&lt;/h3&gt; 
&lt;h4&gt;Kiro platform requirements:&lt;/h4&gt; 
&lt;p&gt;&lt;strong&gt;Kiro subscription&lt;/strong&gt;&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Active Kiro License&lt;/strong&gt;: Valid subscription to Kiro platform&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;User Account&lt;/strong&gt;: Registered Kiro user account with appropriate permissions&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;API Access&lt;/strong&gt;: Kiro API keys or authentication tokens for CLI access&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;&lt;strong&gt;AWS Identity Center integration&lt;/strong&gt;&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/singlesignon/latest/userguide/what-is.html" target="_blank" rel="noopener noreferrer"&gt;&lt;strong&gt;AWS IAM Identity Center&lt;/strong&gt;&lt;/a&gt;&lt;strong&gt; Setup&lt;/strong&gt;: AWS IAM Identity Center enabled in your &lt;a href="https://docs.aws.amazon.com/organizations/latest/userguide/orgs_introduction.html" target="_blank" rel="noopener noreferrer"&gt;AWS organization&lt;/a&gt;&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Permission Sets&lt;/strong&gt;: Configured permission sets for Kiro users with appropriate AWS access&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;User Assignment&lt;/strong&gt;: Users assigned to relevant AWS accounts and permission sets&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;SAML/OIDC Configuration&lt;/strong&gt;: Identity provider integration if using external identity systems&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h3&gt;Additional prerequisites&lt;/h3&gt; 
&lt;ul&gt; 
 &lt;li&gt;Python 3.7+ and Node.js installed locally&lt;/li&gt; 
 &lt;li&gt;Python 3.11+ for &lt;a href="https://aws.amazon.com/lambda/" target="_blank" rel="noopener noreferrer"&gt;AWS Lambda&lt;/a&gt; runtime environment (required for OpenSearch MCP server compatibility)&lt;/li&gt; 
 &lt;li&gt;Sufficient service quotas for Amazon OpenSearch Service instances and Amazon EC2 resources&lt;/li&gt; 
 &lt;li&gt;Recommended access to the analysis instance via AWS Systems Manager Session Manager (recommended). Amazon EMR clusters running HBase workloads&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-iam-role-for-ec2.html" target="_blank" rel="noopener noreferrer"&gt;EMR_EC2_Default_Role&lt;/a&gt; of Amazon EMR EC2 instance profile can execute describe-stacks on AWS CloudFormation stacks in us-east-1&lt;/li&gt; 
 &lt;li&gt;Basic familiarity with HBase operations&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;The deployment follows AWS security best practices with resource-specific permissions, regional restrictions, and encrypted data storage. All AWS IAM policies implement least-privilege access patterns to help secure operation of the log analysis pipeline.&lt;/p&gt; 
&lt;h2&gt;Walkthrough&lt;/h2&gt; 
&lt;p&gt;This walkthrough demonstrates deploying and configuring the AI-powered HBase troubleshooting solution in five key steps:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;Deploy AWS infrastructure using AWS CloudFormation&lt;/li&gt; 
 &lt;li&gt;Configure Amazon EMR analysis log collection&lt;/li&gt; 
 &lt;li&gt;Process and index HBase data&lt;/li&gt; 
 &lt;li&gt;Enable AI-powered analysis&lt;/li&gt; 
 &lt;li&gt;Add custom knowledge base (optional)&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;The complete solution is available in our &lt;a href="https://github.com/aws-samples/sample-emr-hbase-inconsistencies-detection-recovery-mcp-kiro/tree/main" target="_blank" rel="noopener noreferrer"&gt;GitHub repository&lt;/a&gt;.&lt;/p&gt; 
&lt;h3&gt;Step 1: Deploy the infrastructure&lt;/h3&gt; 
&lt;p&gt;Deploy the required AWS infrastructure including Amazon OpenSearch Service domain, Amazon EC2 instances, and AWS IAM roles.&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;To deploy the infrastructure&lt;/em&gt;&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;Deploy AWS CloudFormation stack. Please update &lt;a href="mailto:your-email@example.com" target="_blank" rel="noopener noreferrer"&gt;your-email@example.com&lt;/a&gt; to an email address for security alerts and Advanced Intrusion Detection Environment (AIDE) reports:&lt;/li&gt; 
&lt;/ol&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;# Deploy to development environment
aws cloudformation create-stack \
  --stack-name dev-hbase-log-analysis \
  --template-body file://cloudformation/hbase-log-analysis-simple.yaml \
  --parameters \
    ParameterKey=EnvironmentName,ParameterValue=dev \
    ParameterKey=EC2InstanceType,ParameterValue=m7g.xlarge \
    ParameterKey=SecurityAlertEmail,ParameterValue=your-email@example.com \
  --capabilities CAPABILITY_IAM \
  --region us-east-1
# Wait for deployment to complete (~15-20 minutes)
aws cloudformation wait stack-create-complete \
  --stack-name dev-hbase-log-analysis \
  --region us-east-1&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;ol start="2"&gt; 
 &lt;li&gt;Note the deployment outputs including Amazon OpenSearch Service endpoint and Amazon EC2 instance details in the AWS CloudFormation console.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-55492.png" target="_blank" rel="noopener"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90183" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-55492.png" alt="AWS CloudFormation stack outputs table displaying infrastructure resource identifiers including IAM roles, EC2 instances, security groups, S3 buckets, OpenSearch domain configuration, and VPC details for an HBase log analysis application in the development environment." width="943" height="497"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;p&gt;The deployment creates:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Amazon OpenSearch Service domain with vector search capabilities&lt;/li&gt; 
 &lt;li&gt;Amazon EC2 instance for data processing and AI analysis&lt;/li&gt; 
 &lt;li&gt;AWS IAM roles with appropriate permissions&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/vpc/latest/userguide/vpc-security-groups.html" target="_blank" rel="noopener noreferrer"&gt;Security groups&lt;/a&gt; and Amazon VPC configuration&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h3&gt;Step 2: Connect to Amazon EC2 instance and set up system&lt;/h3&gt; 
&lt;p&gt;Connect to the Amazon EC2 instance using AWS Systems Manager (SSM) and set up the required components.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;&lt;em&gt;To connect and set up the system&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;Run the following commands to get the instance ID from AWS CloudFormation outputs and connect via AWS Systems Manager (SSM):&lt;/li&gt; 
&lt;/ol&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;# Get instance ID
INSTANCE_ID=$(aws cloudformation describe-stacks \
  --stack-name dev-hbase-log-analysis \
  --query 'Stacks[0].Outputs[?OutputKey==`EC2InstanceId`].OutputValue' \
  --output text \
  --region us-east-1)
# Connect via SSM
aws ssm start-session --target $INSTANCE_ID --region us-east-1&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-55493.png" target="_blank" rel="noopener"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90185" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-55493.png" alt="Terminal screenshot showing AWS CLI commands to retrieve an EC2 instance ID from CloudFormation stack outputs and establish an AWS Systems Manager Session Manager connection to the instance in the us-east-1 region." width="1000" height="287"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;ol start="2"&gt; 
 &lt;li&gt;Clone the repository and run automated setup:&lt;/li&gt; 
&lt;/ol&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-bash"&gt;# On EC2 instance
sudo su - ec2-user

# Re-install aws cli
sudo dnf remove awscli -y

# For ARM64 (Graviton instances - default)
curl "https://awscli.amazonaws.com/awscli-exe-linux-aarch64.zip" -o "awscliv2.zip"

# For x86_64 (if using non-Graviton instances)
# curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"

unzip awscliv2.zip
sudo ./aws/install

# update $PATH in ~/.bashrc
echo 'export PATH=$PATH:/usr/local/bin/' &amp;gt;&amp;gt; ~/.bashrc

# Reload ~/.bashrc
source ~/.bashrc

# Fork and clone the source code repository on GitHub: sample-emr-hbase-inconsistencies-detection-recovery-mcp-kiro
git clone https://github.com/YOUR_USERNAME/sample-emr-hbase-inconsistencies-detection-recovery-mcp-kiro.git hbase-analysis
cd hbase-analysis

# Run automated setup
chmod +x ./scripts/setup/automated-system-setup.sh
./scripts/setup/automated-system-setup.sh \
  --emr-version emr-7.12.0 \
  --stack-name dev-hbase-log-analysis \
  --region us-east-1&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;The automated setup script installs:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;System dependencies (&lt;a href="https://aws.amazon.com/cli/" target="_blank" rel="noopener noreferrer"&gt;awscli&lt;/a&gt;, git, unzip)&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://github.com/astral-sh/uv" target="_blank" rel="noopener noreferrer"&gt;uv package manager&lt;/a&gt; and &lt;a href="https://github.com/opensearch-project/opensearch-mcp-server-py" target="_blank" rel="noopener noreferrer"&gt;OpenSearch MCP Server&lt;/a&gt;&lt;/li&gt; 
 &lt;li&gt;Kiro CLI and &lt;a href="https://kiro.dev/docs/getting-started/authentication/" target="_blank" rel="noopener noreferrer"&gt;configuration&lt;/a&gt; with AWS IAM Identity Center authentication. The script will automatically add Apache HBase open source repo and Apache HBase open source operational tools to knowledge bases&lt;/li&gt; 
 &lt;li&gt;HBase source repositories for your Amazon EMR version&lt;/li&gt; 
 &lt;li&gt;Python dependencies and MCP server configuration&lt;/li&gt; 
&lt;/ul&gt; 
&lt;ol start="3"&gt; 
 &lt;li&gt;Add your own knowledge base to Kiro CLI&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;To enhance Kiro CLI’s analysis capabilities with Apache HBase open-source repositories, your organization’s HBase runbooks and troubleshooting guides, you can add your own knowledge base repositories. Here are the commands. Please periodically validate and maintain your runbook contents so that they remain accurate and up-to-date, reflecting any changes in your HBase environment, configurations, or operational procedures.:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-php"&gt;# Navigate to the HBase repositories directory
cd /opt/hbase-repositories
# Clone your organization's HBase runbook repository
git clone &amp;lt;runbook-repository-url&amp;gt; &amp;lt;your-own-runbook-repo&amp;gt;
# Example:
# git clone https://github.com/your-org/hbase-runbooks.git hbase-runbooks
# git clone https://gitlab.company.com/ops/hbase-troubleshooting.git hbase-troubleshooting
# Add your custom repositories to Kiro CLI knowledge base manually (run these commands inside kiro-cli):
echo "/knowledge add --name \"Your custom HBase knowledge base\" --path /opt/hbase-repositories/&amp;lt;your-own-runbook-repo&amp;gt;" | kiro-cli
# Example:
# echo "/knowledge add --name \"Company HBase runbooks\" --path /opt/hbase-repositories/hbase-runbooks" | kiro-cli
# echo "/knowledge add --name \"HBase troubleshooting guides\" --path /opt/hbase-repositories/hbase-troubleshooting" | kiro-cli&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;h3&gt;Step 3: Configure Amazon EMR log analysis collection&lt;/h3&gt; 
&lt;p&gt;Set up data collection from your Amazon EMR clusters to gather HBase logs, metadata, and consistency reports using the recommended direct collection method.&lt;br&gt; &lt;em&gt;To configure Amazon EMR log analysis collection&lt;/em&gt;&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;On your Amazon EMR cluster primary node, run the following commands to download the collection scripts:&lt;/li&gt; 
&lt;/ol&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-bash"&gt;# On EMR primary node
sudo su - hadoop

# Fork and clone the source code repository on GitHub: sample-emr-hbase-inconsistencies-detection-recovery-mcp-kiro
git clone https://github.com/YOUR_USERNAME/sample-emr-hbase-inconsistencies-detection-recovery-mcp-kiro.git hbase-analysis
cd hbase-analysis&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;ol start="2"&gt; 
 &lt;li&gt;Run the interactive collection wizard:&lt;/li&gt; 
&lt;/ol&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-bash"&gt;# Run collection wizard
python3 scripts/utilities/emr_log_collection/emr_cluster_wizard_v2.py
&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Input the parameters like the EMR cluster’s jobflow ID, the log analysis Amazon S3 bucket name, and the lookback hours. The default value of the lookback hours is 4 hours.&lt;/p&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-55494.png" target="_blank" rel="noopener"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90192" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-55494.png" alt="Terminal screenshot of EMR Cluster Log Collection Wizard V2 showing an interactive command-line interface for configuring HBase diagnostic log collection from Amazon EMR clusters, with step indicators, input fields for job flow ID and S3 bucket, validation confirmations, and lookback hour configuration." width="1000" height="1017"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;ol start="3"&gt; 
 &lt;li&gt;The collection wizard performs these actions:&lt;/li&gt; 
&lt;/ol&gt; 
&lt;ul&gt; 
 &lt;li&gt;Collects HBase logs from local filesystem. Please reference to prerequisites for the access permission.&lt;/li&gt; 
 &lt;li&gt;Runs &lt;code&gt;sudo -u hbase hbase hbck -details&lt;/code&gt; (or hbck2 for HBase 2.x)&lt;/li&gt; 
 &lt;li&gt;Runs &lt;code&gt;hdfs dfs -ls -R /hbase&lt;/code&gt; or &lt;code&gt;aws s3 ls &amp;lt;hbase-root-dir&amp;gt;&lt;/code&gt; –recursive&lt;/li&gt; 
 &lt;li&gt;Runs &lt;code&gt;hbase shell &amp;lt;&amp;lt;&amp;lt; 'scan "hbase:meta"'&lt;/code&gt;&lt;/li&gt; 
 &lt;li&gt;Creates properly named files matching analysis system requirements&lt;/li&gt; 
 &lt;li&gt;Uploads to Amazon S3 with correct naming conventions&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;Here’s the data collection summary:&lt;/p&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-55495.png" target="_blank" rel="noopener"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90193" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-55495.png" alt="Terminal screenshot showing EMR Cluster Log Collection Wizard V2 completion summary with job flow ID, S3 bucket location, 4-hour lookback period, green success confirmation message, S3 file path, and detailed listing of seven collected diagnostic files including HBCK reports, HBase meta table scans, root directory paths, process information, log collection summary, node logs from all servers, and collection metadata in JSON format." width="1000" height="567"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;p&gt;You can check the uploaded contents through &lt;a href="https://aws.amazon.com/cli/" target="_blank" rel="noopener noreferrer"&gt;AWS &lt;/a&gt;&lt;a href="https://aws.amazon.com/cli/" target="_blank" rel="noopener"&gt;CLI&lt;/a&gt;.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-bash"&gt;aws s3 ls s3://&amp;lt;log-path&amp;gt; --recursive&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Here’s a screenshot of the outputs.&lt;/p&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-55496.png" target="_blank" rel="noopener"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90194" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-55496.png" alt="Terminal screenshot showing AWS CLI command output listing HBase diagnostic files and logs collected from an EMR cluster and stored in Amazon S3, displaying timestamps, file sizes, and complete S3 object paths including diagnostics directory with HBCK reports, meta table scans, root directory listings, process information, and logs directory with compressed application logs from HBase master and regionserver nodes." width="1000" height="112"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;ol start="4"&gt; 
 &lt;li&gt;On the Analysis Amazon EC2 instance, download collected files to the Analysis Amazon EC2 instance.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-bash"&gt;# On analytics EC2 instance
sudo su - ec2-user

# Download logs from S3
mkdir -p /tmp/hbase-log-analysis
cd /tmp/hbase-log-analysis
aws s3 sync s3://&amp;lt;S3-BUCKET-NAME&amp;gt;/emr-logs/&amp;lt;EMR-JOBFLOW-ID&amp;gt;/ .
&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;You can get your jobflow ID from Amazon EMR console:&lt;/p&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-55497.png" target="_blank" rel="noopener"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90196" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-55497.png" alt="Amazon EMR clusters management dashboard displaying a table with clusters, showing one cluster entry named &amp;quot;test&amp;quot; in waiting status with green indicator, creation time, elapsed time, normalized instances, along with filter controls, search functionality, pagination showing page 1, and action buttons for View details, Terminate, Clone, and Create cluster operations." width="1000" height="109"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;p&gt;The generated files (&lt;code&gt;hbase-hbase-master-ip-xxx-xxx-xxx-xxx.ec2.internal.log.gz&lt;/code&gt;, &lt;code&gt;hbase-hbase-regionserver-ip-xxx-xxx-xxx-xxx.ec2.internal.log.gz&lt;/code&gt;, &lt;code&gt;hbck_report.txt&lt;/code&gt;, &lt;code&gt;hbase_rootdir_paths.txt&lt;/code&gt;, &lt;code&gt;hbase_meta.txt&lt;/code&gt;, &lt;code&gt;hbase_processes.txt&lt;/code&gt;, &lt;code&gt;log_copy_summary.txt&lt;/code&gt;) should be aligned with the automated processing script requirements as following.&lt;/p&gt; 
&lt;h3&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-55498.png" target="_blank" rel="noopener"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90197" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-55498.png" alt="Terminal screenshot showing recursive ls -lRt command output listing HBase diagnostic files and logs in /tmp/hbase-log-analysis/ directory, displaying file permissions, ownership by ec2-user, file sizes, timestamps, and complete directory structure including diagnostics directory with text files (manifest.json, HBCK report, meta table scan, process information, root directory paths, log copy summary), logs directory with nested nodes subdirectory containing redacted instance IDs, and applications/hbase subdirectories with compressed RegionServer and Master log files." width="1000" height="977"&gt;&lt;/a&gt;&lt;/h3&gt; 
&lt;h3&gt;Step 4: Process and index data&lt;/h3&gt; 
&lt;p&gt;Process the collected HBase data and create vector embeddings for intelligent search capabilities.To process and index the data, please navigate to the project directory on the Analysis EC2 instance, and run &lt;code&gt;automated-log-processing.sh:&lt;/code&gt;&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-shell"&gt;sudo su – ec2-user
cd ~/hbase-analysis
chmod +x ./scripts/processing/automated-log-processing.sh
./scripts/processing/automated-log-processing.sh \
  --job-flow-id j-YOUR-JOB-FLOW-ID \
  --stack-name dev-hbase-log-analysis&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;The processing scripts extract and parse HBase logs and generate dimensional vector embeddings from HBase log messages using sentence transformer models to enable semantic search beyond keyword matching. The system uses the &lt;a href="https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2" target="_blank" rel="noopener noreferrer"&gt;all-MiniLM-L6-v2&lt;/a&gt; model by default (producing 384-dimensional embeddings), but supports configurable models with different embedding dimensions, automatically adapting the &lt;a href="https://docs.opensearch.org/latest/vector-search/creating-vector-index/" target="_blank" rel="noopener noreferrer"&gt;OpenSearch vector index&lt;/a&gt; to match the chosen model’s output. The system processes comprehensive HBase operational data including region operations, compaction activities, Write-Ahead Log events, memstore operations, and cluster management information from HMaster and RegionServer logs. &lt;a href="https://docs.opensearch.org/latest/vector-search/" target="_blank" rel="noopener noreferrer"&gt;Vector embeddings&lt;/a&gt; capture error messages, exception stack traces, performance warnings, and multi-line log entries through intelligent text preprocessing. This semantic representation enables advanced troubleshooting where users can query conceptually for “region server performance issues” or “memory pressure” and receive contextually relevant results across different log files and time periods. The vector search capabilities support error correlation by grouping similar exceptions, performance analysis by identifying related bottlenecks, and operational pattern recognition. Each log entry is stored in Amazon OpenSearch Service with original metadata (timestamp, log level, source file, job flow ID) alongside the embedding vector, enabling both structured queries and AI-powered semantic analysis. This approach transforms raw HBase logs into a searchable knowledge base supporting anomaly detection, trend analysis, and predictive insights for proactive cluster management and troubleshooting.&lt;/p&gt; 
&lt;p&gt;All scripts use AWS IAM authentication automatically. Here’s a screenshot of the data processing outputs.&lt;/p&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-55499.png" target="_blank" rel="noopener"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90198" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-55499.png" alt="Terminal screenshot showing successful completion of HBase log analysis processing, green checkmark, confirmation message &amp;quot;Successfully processed 4 file(s)&amp;quot;, and next steps section displaying three numbered instructions with redacted URLs for accessing OpenSearch Dashboards, starting Kiro CLI for AI-powered analysis, and querying data using job flow ID, followed by troubleshooting documentation references for HBase inconsistency analysis and log analysis guides." width="1000" height="243"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;h3&gt;Step 5: Enable AI-powered analysis&lt;/h3&gt; 
&lt;p&gt;Configure the AI analysis interface to enable natural language queries against your HBase operational data.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;&lt;em&gt;To set up AI-powered analysis&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;Launch Kiro CLI (already configured by automated setup):&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;code&gt;kiro-cli&lt;/code&gt;Check mcp and knowledge bases. &lt;code&gt;/mcp list&lt;/code&gt;&lt;/p&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-554910.png" target="_blank" rel="noopener"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90199" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-554910.png" alt="Terminal screenshot showing MCP list command output displaying one configured MCP server named &amp;quot;opensearch-mcp-server&amp;quot; with command &amp;quot;uvx&amp;quot; in green and white text on dark background with pink shell prompt, featuring a purple &amp;quot;Configured MCP Servers&amp;quot; header with checkbox icon and green horizontal separator line." width="1000" height="137"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;p&gt;&lt;code&gt;/knowledge show&lt;/code&gt;&lt;/p&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-554911.png" target="_blank" rel="noopener"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90200" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-554911.png" alt="Terminal screenshot showing &amp;quot;/knowledge show&amp;quot; command output displaying Agent kiro_default's knowledge base with repositories: Apache HBase source code, and HBase operational tools" width="1000" height="154"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;p&gt;If you cannot see these 2 knowledge bases, you can manually add them through the following commands:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-bash"&gt;# Note: Large repositories (~500MB) may take a while to index. Check progress with: /knowledge show
/knowledge add --name "HBase operational tools" --path /opt/hbase-repositories/hbase-operator-tools"
/knowledge add --name "Apache HBase source code" --path /opt/hbase-repositories/hbase"&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;ol start="2"&gt; 
 &lt;li&gt;Use natural language queries to analyze your HBase data. The AI analysis uses both the OpenSearch MCP Server for querying indexed data and the Filesystem knowledge bases for accessing HBase source code. You can add your custom runbooks for Kiro’s reference as well.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;strong&gt;For HBase inconsistency analysis:&lt;/strong&gt;&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-css"&gt;# HBase Inconsistency Detection and Remediation Guidelines
## Search Strategy
- Use fuzzy search for case variations/typos, term query for exact region IDs, match_phrase for paths, query_string for logs
- Always use .keyword subfields for exact text matching
- Cross-reference filesystem (wildcard: {"wildcard": {"path": "*&amp;lt;region_id&amp;gt;*"}}) with hbase:meta (match: {"match": {"row_key": "&amp;lt;region_id&amp;gt;"}})
- The total region count in hbase meta must match the total matched document count of wildcard path like "*/.regioninfo" in hbase rootdir path.  
- All terms of region_name.keyword for a region encoded name must match a wildcard path like "*/.regioninfo"
- All terms of table_name.keyword for a table must match a wildcard path like "*/.tabledesc*"
- 1595e783b53d99cd5eef43b6debb2682 is the master store region that will locate in &amp;lt;hbase-root-dir&amp;gt;/MasterData/data/master/store/1595e783b53d99cd5eef43b6debb2682/
- May cross check with the raw logs in /tmp/hbase-log-analysis/
## Issue Types
Orphan regions, missing .regioninfo, missing/extra regions in hbase:meta, rowkey holes, stuck RIT, master initialization failures
## Analysis Steps
### 1. Cross-Reference Meta vs Filesystem
- Filesystem regions NOT in hbase:meta → ORPHAN REGION
- Meta regions NOT in filesystem → MISSING REGION
### 2. Validate Region Chain Continuity
- Sort regions by STARTKEY, verify region[i].ENDKEY == region[i+1].STARTKEY
- First STARTKEY must be '', last ENDKEY must be ''
- Gaps → ROWKEY HOLE
### 3. Check Region States
- state != 'OPEN' → Check RIT
- Missing server assignment → UNASSIGNED
- Multiple servers → SPLIT BRAIN
- "deployed_servers" field must have only one region server address like "ip-xxx-xxx-xxx-xxx.ec2.internal,16020,1770781485397" . The value should not be null or have multiple values. 
### 4. Validate .regioninfo Files
- Missing .regioninfo in region directory → CORRUPT REGION
### 5. Cross-Check HBCK Report
- Compare orphan counts, RIT regions, filesystem vs meta region counts
### 6. Analyze Logs
- Search: "updating hbase:meta row=&amp;lt;region&amp;gt;", "STUCK", "RIT", "Failed" + "&amp;lt;region&amp;gt;", "Split"/"Merge" + "&amp;lt;region&amp;gt;"
## Remediation
- Reference knowledge bases: "Apache HBase source code", "HBase operational tools"
- Use hbck2: /usr/lib/hbase-operator-tools/hbase-hbck2.jar
- Prefix commands with sudo -u hbase
- Use aws s3 for S3-based rootdir
- Wait 300s after creating holes before hbck fixMeta (catalog janitor cycle)
- Use unassign instead of deprecated close_region
- If the region does not have .regioninfo in  &amp;lt;hbase-root-dir&amp;gt;/data/&amp;lt;namespace&amp;gt;/&amp;lt;table-name&amp;gt;/&amp;lt;region-encoded-name&amp;gt;/ but hbase:meta has that region's information and that region has been deployed on a healthy region server, you can use hbase shell to unassign and assign the region to re-generate .regioninfo
- Always add "sudo -u hbase hbase" before "hbase shell" and "hbase hbck" commands
## Job flow
Target: &amp;lt;your-job-flow-id&amp;gt;
Inconsistency to detect: All kinds of inconsistencies&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;You can trust or input “y” or “t” to grant Kiro to search through mcp and knowledge bases.&lt;/p&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-554912.png" target="_blank" rel="noopener"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90202" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-554912.png" alt="Terminal screenshot showing MCP tool execution authorization prompt." width="1000" height="91"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;p&gt;You may get some outputs like this: Kiro checked for any HBase issue.&lt;/p&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-554913.png" target="_blank" rel="noopener"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90203" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-554913.png" alt="Terminal screenshot showing HBase database query results for user table entries with server configuration details and an HBase Inconsistency Detection Framework analysis report" width="1000" height="168"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;p&gt;Kiro summarized the examination results.&lt;/p&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-554914.png" target="_blank" rel="noopener"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90204" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-554914.png" alt="Terminal screenshot displaying HBase inconsistency detection analysis results for job flow, showing one critical missing .regioninfo file issue for HBase region in a HBase table, with cluster health metrics, risk assessment, recommended fixes, and generated diagnostic reports." width="1000" height="628"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;p&gt;Kiro provided mitigation commands after Kiro summarized the issue.&lt;/p&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-554915.png" target="_blank" rel="noopener"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90206" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-554915.png" alt="Terminal screenshot displaying a structured HBase quick fix guide with three sections: recommended fix procedure with sequential steps for region reassignment, verification steps using AWS S3 and HBCK2 tools, and impact assessment showing 30-60 second downtime, zero data loss risk, and isolated region scope for fixing missing .regioninfo file in HBase region." width="1000" height="626"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;h2&gt;Cleaning up&lt;/h2&gt; 
&lt;p&gt;To avoid incurring future charges, delete the resources created during this walkthrough.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;&lt;em&gt;To clean up the resources&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;&lt;strong&gt;Delete the AWS CloudFormation stack from &lt;a href="https://aws.amazon.com/console/" target="_blank" rel="noopener noreferrer"&gt;AWS Management Console: &lt;/a&gt;&lt;/strong&gt;&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-554916.png" target="_blank" rel="noopener"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90208" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-554916.png" alt="AWS CloudFormation Stacks management console displaying a list view with stacks, showing the &amp;quot;dev-hbase-log-analysis&amp;quot; stack with CREATE_COMPLETE status, along with action buttons for Delete, Update stack, Stack actions, and Create stack." width="1000" height="139"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;ol start="2"&gt; 
 &lt;li&gt;&lt;strong&gt;Clean up Amazon EMR cluster resources (if created only for this walkthrough):&lt;/strong&gt;&lt;/li&gt; 
&lt;/ol&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-554917.png" target="_blank" rel="noopener"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90209" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-554917.png" alt="AWS EMR Clusters management console showing page clusters with a cluster in &amp;quot;Waiting&amp;quot; status" width="1000" height="138"&gt;&lt;/a&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;ol start="3"&gt; 
 &lt;li&gt;&lt;strong&gt;Verify resource cleanup in the &lt;/strong&gt;&lt;a href="https://console.aws.amazon.com/" target="_blank" rel="noopener noreferrer"&gt;&lt;strong&gt;AWS Console&lt;/strong&gt;&lt;/a&gt; to verify that all resources are deleted and review your AWS bill to confirm no unexpected charges.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;strong&gt;Important considerations:&lt;/strong&gt;&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Amazon OpenSearch Service domains take several minutes to fully delete&lt;/li&gt; 
 &lt;li&gt;Amazon S3 buckets with versioning retain object versions&lt;/li&gt; 
 &lt;li&gt;Use smaller instance types for development to optimize costs&lt;/li&gt; 
 &lt;li&gt;Monitor usage with &lt;a href="https://aws.amazon.com/aws-cost-management/aws-cost-explorer/" target="_blank" rel="noopener noreferrer"&gt;AWS Cost Explorer&lt;/a&gt;&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;Conclusion&lt;/h2&gt; 
&lt;p&gt;In this post, we showed you how to build an AI-powered HBase troubleshooting solution that transforms manual log analysis into an automated workflow. By combining Amazon OpenSearch Service vector search with Amazon Bedrock-powered analysis through the Kiro CLI, operations teams can resolve complex HBase inconsistencies faster and gain deeper operational insights. The solution demonstrates how AI augments human expertise to improve operational efficiency, reducing HBase inconsistency resolution from hours to minutes and root cause identification from days to hours. Ready to transform your HBase operations? Get started with the GitHub repository and explore the Amazon OpenSearch Service documentation for additional guidance on vector search capabilities.&lt;/p&gt; 
&lt;h3&gt;Acknowledgments&lt;/h3&gt; 
&lt;p&gt;The author would like to thank Xi Yang, Anirudh Chawla, and Sasidhar Puthambakkam for their contributions to developing the technical solution. Xi Yang is a Senior Hadoop System Engineer and Amazon EMR subject matter expert at AWS. Anirudh Chawla is an AWS Analytics Specialist Solution Architect who helps organizations empower businesses to harness their data effectively through AWS’s analytics platform. Sasidhar Puthambakkam is a Senior Hadoop Systems Engineer and Amazon EMR Subject Matter Expert who provides architectural guidance for complex BigData workloads.&lt;/p&gt; 
&lt;hr style="width: 80%"&gt; 
&lt;h2&gt;About the authors&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-554918-1.png" target="_blank" rel="noopener"&gt;&lt;img loading="lazy" class="alignleft wp-image-90218 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/14/BDB-554918-1.png" alt="Yu-Ting Su" width="238" height="318"&gt;&lt;/a&gt;
  &lt;/div&gt; 
  &lt;p&gt;Yu-ting Su, Sr. Hadoop System Engineer, AWS Support Engineering. Yu-Ting is a Sr. Hadoop Systems Engineer at Amazon Web Services (AWS). Her expertise is in Amazon EMR and Amazon OpenSearch Service. She’s passionate about distributing computation and helping people to bring their ideas to life.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
					
			
		
		
			</item>
		<item>
		<title>How to use streamlined permissions for Amazon S3 Tables and Iceberg materialized views</title>
		<link>https://aws.amazon.com/blogs/big-data/how-to-use-streamlined-permissions-for-amazon-s3-tables-and-iceberg-materialized-views/</link>
					
		
		<dc:creator><![CDATA[Srividya Parthasarathy]]></dc:creator>
		<pubDate>Mon, 11 May 2026 18:59:36 +0000</pubDate>
				<category><![CDATA[Amazon Athena]]></category>
		<category><![CDATA[Amazon EMR]]></category>
		<category><![CDATA[Amazon Redshift]]></category>
		<category><![CDATA[Amazon S3 Tables]]></category>
		<category><![CDATA[Amazon Simple Storage Service (S3)]]></category>
		<category><![CDATA[Analytics]]></category>
		<category><![CDATA[AWS Big Data]]></category>
		<category><![CDATA[AWS Glue]]></category>
		<category><![CDATA[Intermediate (200)]]></category>
		<guid isPermaLink="false">6aebad2189939edb9af142d1c6da760ad5fa5a2b</guid>

					<description>In this post, we walk through how to set up and manage S3 Tables in the AWS Glue Data Catalog, create and query Iceberg materialized views, and configure access controls that work across your analytics stack with IAM-based authorization.</description>
										<content:encoded>&lt;p&gt;Apache Iceberg has emerged as the open table format for data lakes. It handles petabyte-scale datasets, lets teams evolve schemas and partitions in place, and supports time travel and incremental processing for data lake management at scale. &lt;a href="https://aws.amazon.com/s3/features/tables/" target="_blank" rel="noopener noreferrer"&gt;Amazon S3 Tables&lt;/a&gt; provide a fully managed Apache Iceberg table experience in &lt;a href="https://aws.amazon.com/s3/" target="_blank" rel="noopener noreferrer"&gt;Amazon S3&lt;/a&gt;, optimized for analytics workloads, and integrate with the &lt;a href="https://docs.aws.amazon.com/prescriptive-guidance/latest/serverless-etl-aws-glue/aws-glue-data-catalog.html" target="_blank" rel="noopener noreferrer"&gt;AWS Glue Data Catalog&lt;/a&gt; so AWS analytics services such as &lt;a href="https://aws.amazon.com/redshift" target="_blank" rel="noopener noreferrer"&gt;Amazon Redshift&lt;/a&gt;,&amp;nbsp;&lt;a href="https://aws.amazon.com/emr/" target="_blank" rel="noopener noreferrer"&gt;Amazon EMR&lt;/a&gt;,&amp;nbsp;&lt;a href="https://aws.amazon.com/athena" target="_blank" rel="noopener noreferrer"&gt;Amazon Athena&lt;/a&gt;,&amp;nbsp;&lt;a href="https://aws.amazon.com/sagemaker/" target="_blank" rel="noopener noreferrer"&gt;Amazon SageMaker&lt;/a&gt;, and&amp;nbsp;&lt;a href="https://aws.amazon.com/glue" target="_blank" rel="noopener noreferrer"&gt;AWS Glue&lt;/a&gt; query your data. Together, they form the foundation of a modern data lake architecture on AWS.&lt;/p&gt; 
&lt;p&gt;S3 Tables integrate with the AWS Glue Data Catalog using &lt;a href="https://aws.amazon.com/iam/" target="_blank" rel="noopener noreferrer"&gt;AWS Identity and Access Management (IAM)&lt;/a&gt; – based authorization. If you manage analytics workloads across these services, you can now define permissions across storage, catalog, and compute in a single IAM policy. This gives teams already using IAM a straightforward path to govern access to S3 Tables resources without changing their existing permission model. For fine-grained access controls, you can opt in to AWS Lake Formation at any time through the AWS Management Console, &lt;a href="https://aws.amazon.com/cli" target="_blank" rel="noopener noreferrer"&gt;AWS Command Line Interface (AWS CLI)&lt;/a&gt;, API, or &lt;a href="https://aws.amazon.com/blogs/storage/tag/aws-cloudformation/" target="_blank" rel="noopener noreferrer"&gt;AWS CloudFormation&lt;/a&gt;.&lt;/p&gt; 
&lt;p&gt;Iceberg materialized views created in the Glue Data Catalog extend this foundation by letting you store pre-computed query results as Iceberg data on Amazon S3. When a query repeats aggregations or joins across large datasets, the engine reads directly from the materialized view’s S3 location rather than reprocessing the base tables. A materialized view can reside in S3 Tables or in an S3 general purpose bucket, independent of where its base tables live, which lets you place pre-computed results wherever fits your access patterns and cost model best.&lt;/p&gt; 
&lt;p&gt;In this post, we walk through how to set up and manage S3 Tables in the AWS Glue Data Catalog, create and query Iceberg materialized views, and configure access controls that work across your analytics stack with IAM-based authorization.&lt;/p&gt; 
&lt;h2&gt;&amp;nbsp;Solution overview&lt;/h2&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-58901.png"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90882" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-58901.png" alt="Architecture diagram showing AWS Glue Data Catalog integration with Amazon Athena, AWS Glue, Amazon Redshift, and Amazon EMR through IAM roles and policies, with Amazon S3 storage and optional AWS Lake Formation governance." width="1404" height="1143"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;p&gt;The above architecture illustrates how S3 Tables integrate with AWS Glue Data Catalog using IAM-based authorization, so you can define the necessary permissions across storage, catalog, and query engines in a single IAM policy. This permission model accelerates onboarding for new teams and workloads.&lt;/p&gt; 
&lt;h3&gt;Key architecture components include:&lt;/h3&gt; 
&lt;p&gt;&lt;strong&gt;Storage Layer: &lt;/strong&gt;Data stored as Iceberg tables in Amazon S3 Tables&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Catalog Layer&lt;/strong&gt;: AWS Glue Data Catalog serves as the single metadata repository.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Compute Layer&lt;/strong&gt; – Amazon Athena, AWS Glue, Amazon Redshift, and Amazon EMR connect to a single data Catalog to access Iceberg tables.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Security&lt;/strong&gt;: AWS IAM authorizes access to resources in storage, catalog, and compute layers.&lt;/p&gt; 
&lt;h2&gt;Prerequisites:&lt;/h2&gt; 
&lt;p&gt;To follow along with this post, you must have an &lt;a href="https://aws.amazon.com/resources/create-account/" target="_blank" rel="noopener noreferrer"&gt;AWS account&lt;/a&gt; and an IAM role or user with appropriate permissions and familiarity to the following services:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;IAM&lt;/li&gt; 
 &lt;li&gt;AWS Glue Data Catalog&lt;/li&gt; 
 &lt;li&gt;Amazon S3&lt;/li&gt; 
 &lt;li&gt;Amazon Athena&lt;/li&gt; 
 &lt;li&gt;Amazon Redshift&lt;/li&gt; 
 &lt;li&gt;Amazon EMR&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;For the minimum permissions required for the role/user for metadata and data access, refer to &lt;a href="https://docs.aws.amazon.com/glue/latest/dg/s3tables-catalog-prerequisites.html#s3tables-required-iam-permissions" target="_blank" rel="noopener noreferrer"&gt;required IAM permissions documentation&lt;/a&gt;.&lt;/p&gt; 
&lt;h2&gt;Solution walkthrough&lt;/h2&gt; 
&lt;p&gt;In this walkthrough, you will integrate S3 Tables with the AWS Glue Data Catalog, create Iceberg materialized views, and query data using multiple analytics engines. You will also learn to use materialized views when you have complex aggregations queried frequently but underlying data changes. You can follow these steps to implement the solution. It will take about 45–60 minutes to complete this walkthrough.&lt;/p&gt; 
&lt;h3&gt;Setup S3 Tables and integrate with Glue Data Catalog&lt;/h3&gt; 
&lt;p&gt;Navigate to Amazon S3 console:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;On the left menu, select &lt;strong&gt;Table buckets.&lt;/strong&gt;&lt;/li&gt; 
 &lt;li&gt;Choose the &lt;strong&gt;Create table bucket&lt;/strong&gt; button.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-58902.jpg"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90883" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-58902.jpg" alt="Amazon S3 console showing the Table buckets management page in the US West (N. California) us-west-1 Region with zero table buckets, integration status disabled, and the Create table bucket button highlighted." width="989" height="565"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;ol start="3"&gt; 
 &lt;li&gt;In the next screen, we will fill the name of the bucket as &lt;strong&gt;salesbucket&lt;/strong&gt;. Please ensure the &lt;strong&gt;Enable Integration configuration&lt;/strong&gt; is checked. This step integrates S3 Tables with AWS Glue Data Catalog.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-58903.jpg"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90884" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-58903.jpg" alt="AWS S3 Create table bucket form with General configuration showing bucket name &amp;quot;salesbucket&amp;quot; and Integration with AWS analytics services section with Enable integration checkbox selected." width="989" height="520"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;ol start="4"&gt; 
 &lt;li&gt;Keep the other options as default and choose &lt;strong&gt;Create table bucket&lt;/strong&gt;.&lt;/li&gt; 
 &lt;li&gt;After it is created, you will be redirected back to the list of table buckets. Choose the table bucket &lt;strong&gt;salesbucket&lt;/strong&gt;.&lt;/li&gt; 
 &lt;li&gt;Select the &lt;strong&gt;Create table with Athena&lt;/strong&gt; button.&lt;/li&gt; 
 &lt;li&gt;Create a namespace in S3 Tables which is equivalent to a database in AWS Glue Data Catalog. Enter namespace (database) name as “sales” and click &lt;strong&gt;Create namespace&lt;/strong&gt;.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-58904.jpg"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90885" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-58904.jpg" alt="Create table with Athena dialog in the Amazon S3 salesbucket console showing namespace configuration with &amp;quot;Create a namespace&amp;quot; selected and namespace name set to &amp;quot;sales.&amp;quot;" width="1375" height="747"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;ol start="8"&gt; 
 &lt;li&gt;Choose &lt;strong&gt;Create table with Athena&lt;/strong&gt;, and a new tab will be open with the Amazon Athena console.&lt;/li&gt; 
 &lt;li&gt;When the Amazon Athena console opens, you will see an example of a query to create a table and examples to insert rows in that table. You could use this query block by uncommenting the code and executing each statement individually by highlighting it. At the end, you will have data in the table.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-58905.jpg"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90887" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-58905.jpg" alt="Amazon Athena query editor showing a SQL analytics query on the daily_sales table with results displaying product categories, units sold, total revenue, and average price for February 2024 sales data." width="1357" height="741"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;h3&gt;Query S3 Tables and create materialized view using Amazon EMR:&lt;/h3&gt; 
&lt;p&gt;To run the instruction on Amazon EMR, complete the following steps to configure the cluster:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;Create an IAM role for the Amazon EMR instance profile&amp;nbsp;following the &lt;a href="https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-iam-role.html" target="_blank" rel="noopener noreferrer"&gt;Amazon EMR Management Guide&lt;/a&gt;. Add the following as policies and trust relationship for working on materialized views.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;Replace ACCOUNT_ID with your AWS account ID, Instance_profile_role to the Amazon EMR instance profile role, and REGION with your AWS Region.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-css"&gt;{
&amp;nbsp;&amp;nbsp; "Version":"2012-10-17",
&amp;nbsp;&amp;nbsp; "Statement":[
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp;{
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; "Sid":"GlueDataCatalogPermissions",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; "Effect":"Allow",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; "Action":[
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"glue:GetCatalog",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"glue:GetDatabase",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"glue:CreateTable",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"glue:GetTable",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"glue:GetTables",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"glue:UpdateTable",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"glue:DeleteTable"
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; ],
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; "Resource":[
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"arn:aws:glue:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNT ID&amp;gt;:catalog",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"arn:aws:glue:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNT ID&amp;gt;:catalog/s3tablescatalog",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"arn:aws:glue:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNT ID&amp;gt;:catalog/s3tablescatalog/*",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"arn:aws:glue:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNT ID&amp;gt;:database/salesdb",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"arn:aws:glue:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNT ID&amp;gt;:database/salesdb/*",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"arn:aws:glue:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNT ID&amp;gt;:database/s3tablescatalog",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"arn:aws:glue:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNT ID&amp;gt;:database/s3tablescatalog/*",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"arn:aws:glue:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNT ID&amp;gt;:table/s3tablescatalog/*",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"arn:aws:glue:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNT ID&amp;gt;:table/*/*"
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; ]
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp;},
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp;{
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; "Sid":"S3TablesDataAccessPermissions",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; "Effect":"Allow",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; "Action":[
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"s3tables:GetTableBucket",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"s3tables:GetNamespace",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"s3tables:GetTable",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"s3tables:GetTableMetadataLocation",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"s3tables:GetTableData",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"s3tables:ListTableBuckets",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"s3tables:CreateTable",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"s3tables:PutTableData",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"s3tables:UpdateTableMetadataLocation",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"s3tables:ListNamespaces",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"s3tables:ListTables",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"s3tables:DeleteTable"
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; ],
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; "Resource":[
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"arn:aws:s3tables:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNT ID&amp;gt;:bucket/*"
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; ]
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp;},
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp;{
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; "Effect":"Allow",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; "Action":"iam:PassRole",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; "Resource":"arn:aws:iam::&amp;lt;ACCOUNT ID&amp;gt;:role/service-role/&amp;lt;Instance_profile_role&amp;gt;"
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp;}
&amp;nbsp;&amp;nbsp; ]
}&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Add the following to the trust policy in addition to existing:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-css"&gt;&amp;nbsp;{
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"Sid": "",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"Effect": "Allow",
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"Principal": {
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"Service": "glue.amazonaws.com"
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;},
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;"Action": "sts:AssumeRole"
&amp;nbsp;&amp;nbsp; &amp;nbsp; &amp;nbsp; &amp;nbsp;}&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;ol start="2"&gt; 
 &lt;li&gt;Launch an Amazon EMR cluster 7.12.0 or higher with instance profile role created in the previous step and with Iceberg enabled. For more information, refer to &lt;a href="https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-iceberg-use-spark-cluster.html" target="_blank" rel="noopener noreferrer"&gt;Use an Iceberg cluster with Spark&lt;/a&gt;.&lt;/li&gt; 
 &lt;li&gt;Connect to the primary node of your Amazon EMR cluster by using SSH, and run the following command to start a Spark application with the required configurations:&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;Replace&amp;nbsp;bucket_name with your bucket name.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-css"&gt;spark-sql \
&amp;nbsp;&amp;nbsp;--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
&amp;nbsp;&amp;nbsp;--conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog \
&amp;nbsp;&amp;nbsp;--conf spark.sql.catalog.glue_catalog.type=glue \
&amp;nbsp;&amp;nbsp;--conf spark.sql.catalog.glue_catalog.warehouse=s3://&amp;lt;bucket_name&amp;gt; \
&amp;nbsp;&amp;nbsp;--conf spark.sql.catalog.glue_catalog.glue.region=&amp;lt;region&amp;gt; \
&amp;nbsp;&amp;nbsp;--conf spark.sql.catalog.glue_catalog.glue.id=&amp;lt;accountid&amp;gt;:s3tablescatalog/salesbucket \
&amp;nbsp;&amp;nbsp;--conf spark.sql.catalog.glue_catalog.glue.account-id=&amp;lt;accountid&amp;gt; \
&amp;nbsp;&amp;nbsp;--conf spark.sql.catalog.glue_catalog.client.region=&amp;lt;region&amp;gt; \
&amp;nbsp;&amp;nbsp;--conf spark.sql.optimizer.answerQueriesWithMVs.enabled=true \
&amp;nbsp;&amp;nbsp;--conf spark.sql.defaultCatalog=glue_catalog&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;ol start="4"&gt; 
 &lt;li&gt;Run the following queries to query the daily_sales table.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-sql"&gt;spark-sql ()&amp;gt; use sales;
spark-sql (sales)&amp;gt; select * from daily_sales;
2024-01-15 Laptop 900.0
2024-01-15 Monitor 250.0
2024-01-16 Laptop 1350.0
2024-02-01 Monitor 300.0
2024-02-01 Keyboard 60.0
2024-02-02 Mouse 25.0
2024-02-02 Laptop 1050.0
2024-02-03 Laptop 1200.0
2024-02-03 Monitor 375.0&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;ol start="5"&gt; 
 &lt;li&gt;Create Materialized view.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-sql"&gt;CREATE MATERIALIZED VIEW sales_mv as 
SELECT 
&amp;nbsp; &amp;nbsp; product_category,
&amp;nbsp; &amp;nbsp;&amp;nbsp;COUNT(*) as units_sold,
&amp;nbsp; &amp;nbsp; SUM(sales_amount) as total_revenue, 
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;AVG(sales_amount) as average_price 
FROM 
&amp;nbsp; &amp;nbsp; glue_catalog.sales.daily_sales 
GROUP BY 
&amp;nbsp; &amp;nbsp; product_category;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;A newly created materialized view is populated with the initial query results but does not update automatically as base table data changes. To keep it current, specify a REFRESH EVERY clause when creating the view. This accepts a time interval and unit, so you can define how often the materialized view is recomputed from the base tables.&lt;/p&gt; 
&lt;ol start="6"&gt; 
 &lt;li&gt;Add refresh interval.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-sql"&gt;CREATE MATERIALIZED VIEW sales_mv 
SCHEDULE REFRESH EVERY 2 HOURS as 
SELECT 
&amp;nbsp; &amp;nbsp; product_category,
&amp;nbsp; &amp;nbsp;&amp;nbsp;COUNT(*) as units_sold,
&amp;nbsp; &amp;nbsp; SUM(sales_amount) as total_revenue, 
&amp;nbsp;&amp;nbsp;&amp;nbsp;&amp;nbsp;AVG(sales_amount) as average_price 
FROM 
&amp;nbsp; &amp;nbsp; glue_catalog.sales.daily_sales 
GROUP BY 
&amp;nbsp; &amp;nbsp; product_category;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;ol start="7"&gt; 
 &lt;li&gt;Alternatively, you can refresh them manually.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;For manual full refresh, you can use the following command:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-sql"&gt;REFRESH MATERIALIZED VIEW&amp;nbsp;sales_mv FULL;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;For manual incremental refresh, you can use the following command:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-sql"&gt;REFRESH MATERIALIZED VIEW&amp;nbsp;sales_mv;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;For more details, refer to &lt;a href="https://docs.aws.amazon.com/glue/latest/dg/materialized-views.html#materialized-views-refreshing" target="_blank" rel="noopener noreferrer"&gt;Refreshing materialized views&lt;/a&gt;.&lt;/p&gt; 
&lt;ol start="8"&gt; 
 &lt;li&gt;Query the MV.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-sql"&gt;spark-sql (sales)&amp;gt; select * from sales_mv
Keyboard 1 60.0 60.0
Laptop 4 4500.0 1125.0
Mouse 1 25.0 25.0
Monitor 3 925.0 308.3333333333333&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;After the Iceberg materialized views are created, you can access them using IAM principals that have required IAM permissions to Glue Data Catalog resource and its underlying storage.&lt;/p&gt; 
&lt;p&gt;Iceberg materialized views are flexible in how they combine base tables and access control modes. Base tables can reside in S3 general-purpose buckets (with IAM or Lake Formation access control), in S3 Tables (through the s3tablescatalog catalog), or a combination of these—all within a single materialized view definition. The materialized view itself can use either IAM or AWS Lake Formation access control, independently of its base tables.&lt;/p&gt; 
&lt;p&gt;For more details, refer to &lt;a href="https://docs.aws.amazon.com/glue/latest/dg/materialized-views.html#materialized-views-how-they-work" target="_blank" rel="noopener noreferrer"&gt;How materialized views work with AWS Glue&lt;/a&gt;.&lt;/p&gt; 
&lt;h3&gt;Query using Athena:&lt;/h3&gt; 
&lt;p&gt;Additionally, you can query the same materialized view from Athena SQL. The following image shows the same query run on Athena and the resulting output.&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-58906.png"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90888" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-58906.png" alt="Amazon Athena query editor showing SELECT query results from the sales_mv materialized view with product category aggregations including Keyboard and Laptop sales data." width="1429" height="610"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;h3&gt;Query using Amazon Redshift:&lt;/h3&gt; 
&lt;p&gt;To query the S3 Tables in AWS Glue Data Catalog using Amazon Redshift, you must create a database in the default catalog in Glue Data Catalog that points to the S3 Tables catalog.&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;On the AWS Glue console, choose &lt;strong&gt;Databases&lt;/strong&gt;, and then choose &lt;strong&gt;Add Database&lt;/strong&gt;.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-58907.jpg"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90889" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-58907.jpg" alt="AWS Glue Data Catalog Databases page showing one default database in catalog 466053964652, with the Add database button highlighted." width="995" height="556"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;ol start="2"&gt; 
 &lt;li&gt;Choose the &lt;strong&gt;Glue Database resource link&lt;/strong&gt; option, add a name for the database, choose &lt;strong&gt;salesbucket&lt;/strong&gt; on the target catalog and &lt;strong&gt;sales&lt;/strong&gt; as the target database. Then select &lt;strong&gt;Create database&lt;/strong&gt;.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-58908.jpg"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90890" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-58908.jpg" alt="AWS Glue Create a database form with Glue Database Resource Link selected, name set to &amp;quot;salesdb,&amp;quot; target catalog &amp;quot;salesbucket,&amp;quot; and target database &amp;quot;sales.&amp;quot;" width="992" height="553"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;p&gt;After creating the database, we will see the “salesdb” resource link under &lt;strong&gt;Databases&lt;/strong&gt; on AWS Glue Data Catalog.&lt;/p&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-58909.jpg"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90891" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-58909.jpg" alt="AWS Glue Data Catalog Databases page showing two databases: &amp;quot;default&amp;quot; and the newly created &amp;quot;salesdb&amp;quot; resource link with source catalog pointing to s3tablescatalog." width="1368" height="361"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;p&gt;Create IAM role with the following policy for the Amazon Redshift schema creation. Replace the AWS Region and account ID for your account.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Sid":"GlueDataCatalogPermissions",
         "Effect":"Allow",
         "Action":[
            "glue:GetCatalog",
            "glue:GetDatabase",
            "glue:CreateTable",
            "glue:GetTable",
            "glue:GetTables",
            "glue:UpdateTable",
            "glue:DeleteTable"
         ],
         "Resource":[
            "arn:aws:glue:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNTID&amp;gt;:catalog",
            "arn:aws:glue:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNTID&amp;gt;:catalog/s3tablescatalog",
            "arn:aws:glue:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNTID&amp;gt;:catalog/s3tablescatalog/*",
            "arn:aws:glue:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNTID&amp;gt;:database/salesdb",
            "arn:aws:glue:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNTID&amp;gt;:database/salesdb/*",
            "arn:aws:glue:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNTID&amp;gt;:database/s3tablescatalog",
            "arn:aws:glue:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNTID&amp;gt;:database/s3tablescatalog/*",
            "arn:aws:glue:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNTID&amp;gt;:table/s3tablescatalog/*",
            "arn:aws:glue:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNTID&amp;gt;:table/*/*"
         ]
      },
      {
         "Sid":"S3TablesDataAccessPermissions",
         "Effect":"Allow",
         "Action":[
            "s3tables:GetTableBucket",
            "s3tables:GetNamespace",
            "s3tables:GetTable",
            "s3tables:GetTableMetadataLocation",
            "s3tables:GetTableData",
            "s3tables:ListTableBuckets",
            "s3tables:CreateTable",
            "s3tables:PutTableData",
            "s3tables:UpdateTableMetadataLocation",
            "s3tables:ListNamespaces",
            "s3tables:ListTables",
            "s3tables:DeleteTable"
         ],
         "Resource":[
            "arn:aws:s3tables:&amp;lt;REGION&amp;gt;:&amp;lt;ACCOUNTID&amp;gt;:bucket/*"
         ]
      }
   ]
}&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;Create an&amp;nbsp;&lt;a href="https://docs.aws.amazon.com/redshift/latest/gsg/new-user.html" target="_blank" rel="noopener noreferrer"&gt;Amazon Redshift&lt;/a&gt;&amp;nbsp;provisioned cluster or&amp;nbsp;&lt;a href="https://docs.aws.amazon.com/redshift/latest/mgmt/serverless-console-workgroups-create-workgroup-wizard.html" target="_blank" rel="noopener noreferrer"&gt;Amazon Redshift Serverless&lt;/a&gt;, attaching the IAM role created in previous step.&lt;/p&gt; 
&lt;p&gt;To access the AWS Glue Catalog and the resource link, you can now log in to Amazon Redshift as a local user. We use the &lt;strong&gt;admin&lt;/strong&gt; user and Amazon Redshift Query Editor v2.&lt;/p&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-589010.jpg"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90892" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-589010.jpg" alt="Amazon Redshift Query Editor v2 interface connected to Serverless workgroup &amp;quot;s3tablesblog&amp;quot; showing 2 native databases and 1 external database with an empty query editor ready for input." width="581" height="238"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;p&gt;To create the external schema, you must run the following command: Replace ACCOUNT_ID with your AWS Account ID, IAM_ROLE to IAM role created for schema access, and REGION with your AWS Region.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-sql"&gt;CREATE EXTERNAL SCHEMA salesdb
FROM DATA CATALOG DATABASE 'salesdb'
IAM_ROLE 'arn:aws:iam::&amp;lt;ACCOUNT_ID&amp;gt;:role/&amp;lt;IAM_ROLE&amp;gt;'
REGION '&amp;lt;REGION&amp;gt;'
CATALOG_ID '&amp;lt;ACCOUNT_ID&amp;gt;';&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;After you have created the external schema, it will show up on the left side, under the dev database. The table that we created, daily_sales, is available and we can query directly from Amazon Redshift using a local user.&lt;/p&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-589011.jpg"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90893" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-589011.jpg" alt="Amazon Redshift Query Editor v2 showing a SELECT query on the daily_sales table in the salesdb schema with 9 rows of results displaying sale dates, product categories, and sales amounts from January–February 2024." width="1376" height="739"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;h2&gt;Cleanup:&lt;/h2&gt; 
&lt;p&gt;After completing the walkthrough, follow these steps to remove the resources and avoid ongoing charges. These cleanup steps will permanently delete the data, including the daily_sales table and sales_mv materialized view. Make sure that you have backed up the data that you need to retain before proceeding.&lt;/p&gt; 
&lt;p&gt;To avoid incurring future charges, clean up the resources that you created during this walkthrough:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Remove the Glue Data Catalog resources&lt;/li&gt; 
 &lt;li&gt;Delete the&amp;nbsp;table bucket&lt;/li&gt; 
 &lt;li&gt;Terminate and Delete the Amazon Redshift cluster&lt;/li&gt; 
 &lt;li&gt;Terminate and Delete the Amazon EMR cluster&lt;/li&gt; 
 &lt;li&gt;Delete the IAM roles/policies created&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;Conclusion&lt;/h2&gt; 
&lt;p&gt;Amazon S3 Tables now integrate with AWS Glue Data Catalog through IAM-based authorization via a single IAM policy. By consolidating permissions for storage, catalog, and query engines into one IAM policy, you can streamline authorization with AWS analytics services like Amazon Athena, Amazon EMR, and AWS Glue. You can use this streamlined IAM authorization model to build your data lake faster while maintaining enterprise-grade security. For organizations with additionally granular data access requirements, AWS Lake Formation remains available to layer fine-grained access controls on top of this foundation. This is configurable through the AWS Management Console, CLI, API, or CloudFormation. This integration allows AWS analytics users to use IAM and scale their analytics capabilities with reduced operational complexity.&lt;/p&gt; 
&lt;p&gt;To learn more about to S3 Tables and integration with Glue Data catalog, visit: &lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tables-integration-overview.html" target="_blank" rel="noopener noreferrer"&gt;Amazon S3 Tables integration with AWS analytics services overview&lt;/a&gt; and &lt;a href="https://docs.aws.amazon.com/glue/latest/dg/glue-federation-s3tables.html" target="_blank" rel="noopener noreferrer"&gt;Integrating with Amazon S3 Tables&lt;/a&gt;.&lt;/p&gt; 
&lt;hr&gt; 
&lt;h2&gt;About the authors&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2025/10/16/rserafim.jpg"&gt;&lt;img loading="lazy" class="wp-image-84745 size-thumbnail alignleft" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2025/10/16/rserafim-100x133.jpg" alt="" width="100" height="133"&gt;&lt;/a&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Ricardo Serafim&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/rcserafim/"&gt;Ricardo&lt;/a&gt; is a Senior Analytics Specialist Solutions Architect at AWS. He has been helping companies with Data Warehouse solutions since 2007.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2024/10/18/Milindo.png"&gt;&lt;img loading="lazy" class="alignleft wp-image-70421 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2024/10/18/Milindo.png" alt="" width="100" height="129"&gt;&lt;/a&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Milind Oke&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/milindoke/"&gt;Milind&lt;/a&gt; is a Data Warehouse Specialist Solutions Architect based out of New York. He has been building data warehouse solutions for over 15 years and specializes in Amazon Redshift.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2025/11/26/pratdas.jpeg"&gt;&lt;img loading="lazy" class="alignleft wp-image-85700 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2025/11/26/pratdas.jpeg" alt="" width="100" height="133"&gt;&lt;/a&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Pratik Das&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/das-pratik/"&gt;Pratik&lt;/a&gt; is a Senior Product Manager with AWS Lake Formation. He is passionate about all things data and works with customers to understand their requirements and build delightful experiences. He has a background in building data-driven solutions and machine learning systems.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2025/11/26/srivipar.jpg"&gt;&lt;img loading="lazy" class="size-full wp-image-85701 alignleft" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2025/11/26/srivipar.jpg" alt="" width="100" height="133"&gt;&lt;/a&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Srividya Parthasarathy&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/srividya-parthasarathy-8b71bb32/"&gt;Srividya&lt;/a&gt; is a Senior Big Data Architect on the AWS Lake Formation team. She works with the product team and customers to build robust features and solutions for their analytical data platform. She enjoys building data mesh solutions and sharing them with the community.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
					
			
		
		
			</item>
		<item>
		<title>Improve DynamoDB analytics with AWS Glue zero-ETL schema and partition controls</title>
		<link>https://aws.amazon.com/blogs/big-data/improve-dynamodb-analytics-with-aws-glue-zero-etl-schema-and-partition-controls/</link>
					
		
		<dc:creator><![CDATA[Raju Ansari]]></dc:creator>
		<pubDate>Mon, 11 May 2026 18:51:22 +0000</pubDate>
				<category><![CDATA[Advanced (300)]]></category>
		<category><![CDATA[Amazon DynamoDB]]></category>
		<category><![CDATA[Amazon SageMaker Lakehouse]]></category>
		<category><![CDATA[AWS Glue]]></category>
		<category><![CDATA[Best Practices]]></category>
		<category><![CDATA[Technical How-to]]></category>
		<category><![CDATA[Data Integrations]]></category>
		<category><![CDATA[Data Lake]]></category>
		<category><![CDATA[DynamoDB]]></category>
		<category><![CDATA[zero-ETL]]></category>
		<guid isPermaLink="false">28a8aea903a5d1a76f281acb8fdf7640a3125781</guid>

					<description>In this post, you learn how to replicate Amazon DynamoDB data to Apache Iceberg tables in Amazon S3 through a zero-ETL integration. We walk through the challenges that the DynamoDB nested, schema-flexible data model introduces for analytics workloads, and show you how to configure schema unnesting and data partitioning for a sample product catalog table. We also cover how to query the replicated data in Amazon Athena using standard SQL.</description>
										<content:encoded>&lt;p&gt;You store transactional data in &lt;a href="https://aws.amazon.com/dynamodb/" target="_blank" rel="noopener noreferrer"&gt;Amazon DynamoDB&lt;/a&gt; and get single-digit millisecond performance. However, when you want to run analytics, machine learning (ML), or reporting on that same data, you face a gap: your flexible, semi-structured DynamoDB schemas don’t align with the flat, columnar formats that analytics engines require. Bridging this gap typically means building and maintaining custom ETL pipelines, which adds development cost and operational overhead.&lt;/p&gt; 
&lt;p&gt;&lt;a href="https://docs.aws.amazon.com/glue/latest/dg/zero-etl-using.html" target="_blank" rel="noopener noreferrer"&gt;AWS Glue Zero-ETL&lt;/a&gt; integration removes that pipeline work. It enables replication of your DynamoDB tables to Apache Iceberg tables in &lt;a href="https://aws.amazon.com/s3/" target="_blank" rel="noopener noreferrer"&gt;Amazon Simple Storage Service (Amazon S3)&lt;/a&gt;, then query it directly with &lt;a href="https://aws.amazon.com/athena/" target="_blank" rel="noopener noreferrer"&gt;Amazon Athena&lt;/a&gt;. During setup, you can configure two capabilities that will shape how replicated data looks and performs: &lt;strong&gt;schema unnesting&lt;/strong&gt; flattens nested attributes into individual columns, and &lt;strong&gt;data partitioning&lt;/strong&gt; organizes data so your queries scan only what they need.&lt;/p&gt; 
&lt;p&gt;In this post, you learn how to replicate Amazon DynamoDB data to Apache Iceberg tables in Amazon S3 through a zero-ETL integration. We walk through the challenges that the DynamoDB nested, schema-flexible data model introduces for analytics workloads, and show you how to configure schema unnesting and data partitioning for a sample product catalog table. We also cover how to query the replicated data in Amazon Athena using standard SQL.&lt;/p&gt; 
&lt;h2&gt;Semi-structured data meets analytics&lt;/h2&gt; 
&lt;p&gt;Your product catalog in DynamoDB contains items with nested attributes like product details, pricing tiers, and inventory information. A typical item looks like this:&lt;/p&gt; 
&lt;pre&gt;&lt;code class="lang-json"&gt;{
  "product_id": "P-1001",
  "name": "Wireless Headphones",
  "productdetails": {
    "brand": "AudioTech",
    "category": "Electronics",
    "weight_kg": 0.25,
    "specification": {
       "color": "Black",
       "storage": "128GB"
    }
  },
  "pricing": {
    "list_price": 79.99,
    "discount_pct": 10
  },
  "created_at": 1701388800000
}&lt;/code&gt;&lt;/pre&gt; 
&lt;p&gt;This structure supports fast transactional reads and writes. However, when you replicate this data for analytics, you face two decisions:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;You must decide whether to flatten nested maps like &lt;code&gt;productdetails&lt;/code&gt; into individual columns or preserve them as-is.&lt;/li&gt; 
 &lt;li&gt;You must choose how to organize the data on disk so that queries filtering by brand or date range scan only relevant partitions.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;With AWS Glue Zero-ETL, you address both decisions through configurable schema unnesting and data partitioning.&lt;/p&gt; 
&lt;h2&gt;Solution overview&lt;/h2&gt; 
&lt;p&gt;You replicate data from your DynamoDB table through AWS Glue Zero-ETL into Apache Iceberg tables stored in Amazon S3, then query the results with Amazon Athena. The following diagram illustrates the end-to-end architecture:&lt;/p&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5703-image-1.jpeg"&gt;&lt;img loading="lazy" class="size-full wp-image-90947" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5703-image-1.jpeg" alt="Data flow diagram showing AWS data pipeline: DynamoDB source table → AWS Glue zero-ETL integration → Apache Iceberg on Amazon S3 → Amazon Athena analytics query." width="753" height="261"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;p&gt;AWS Glue zero-ETL ingests data from Amazon DynamoDB, writes it in Apache Iceberg format to your Amazon S3 data lake, and makes it available for SQL queries in Amazon Athena—with no pipelines to build or maintain. With this integration, you:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Save development time&lt;/strong&gt; by skipping custom code and ETL job management&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Keep DynamoDB performance intact&lt;/strong&gt; because replication doesn’t consume table’s provisioned read/write capacity&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Get data within 15 minutes&lt;/strong&gt; of changes in the source table&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Query with standard tools&lt;/strong&gt; because data lands in Apache Iceberg format, an open table format that AWS natively supports for high-performance analytics&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;During setup, you configure two output settings:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;&lt;strong&gt;Schema unnesting in Zero-ETL&lt;/strong&gt;: You choose how nested attributes appear in the target. Flattening nested maps into individual columns streamlines your queries and reduces complexity.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Data partitioning in Zero-ETL&lt;/strong&gt;: You choose how data is organized into partitions. When you filter on a partition column, the query engine reads only matching data instead of scanning everything, cutting both query time and cost.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;h2&gt;Schema unnesting&lt;/h2&gt; 
&lt;p&gt;When you create a zero-ETL integration, you can choose one of three unnesting options. Schema unnesting transforms complex, nested DynamoDB structures into formats that analytics engines can query directly, removing post-processing transformations.&lt;/p&gt; 
&lt;p&gt;Each option changes how nested DynamoDB attributes appear in the target table. The right choice depends on your analytics tools and how consistent your DynamoDB schemas are.&lt;/p&gt; 
&lt;h3&gt;Option 1: No unnesting&lt;/h3&gt; 
&lt;p&gt;This option preserves the original nested structure. DynamoDB maps and lists remain as structured columns in the target.&lt;/p&gt; 
&lt;p&gt;Using the product example, the target table retains &lt;code&gt;productid&lt;/code&gt; and &lt;code&gt;value&lt;/code&gt; as columns to hold DynamoDB partition key and a DynamoDB record respectively.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Recommended for&lt;/strong&gt;: Workloads where your analytics tools natively support querying nested data and you want to preserve the DynamoDB structure unchanged.&lt;/p&gt; 
&lt;h3&gt;Option 2: Unnest one level&lt;/h3&gt; 
&lt;p&gt;This option flattens top-level maps into individual columns. Lists remain nested.&lt;/p&gt; 
&lt;p&gt;With this option, &lt;code&gt;productdetails&lt;/code&gt; and &lt;code&gt;pricing &lt;/code&gt;each become separate columns.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Recommended for&lt;/strong&gt;: Scenarios where your DynamoDB items have a consistent schema and you want to balance structure preservation with query simplicity.&lt;/p&gt; 
&lt;h3&gt;Option 3: Unnest all levels (default)&lt;/h3&gt; 
&lt;p&gt;This option recursively flattens nested structures using dot notation and produces the flattest schema.&lt;/p&gt; 
&lt;p&gt;For the product table, this creates columns such as &lt;code&gt;productdetails.brand, productdetails.category&lt;/code&gt;, &lt;code&gt;productdetails.specification.color&lt;/code&gt; , &lt;code&gt;productdetails.specification.storage&lt;/code&gt; , &lt;code&gt;pricing.list_price&lt;/code&gt;, and &lt;code&gt;pricing.discount_pct&lt;/code&gt;. The pricing map flattens similarly. Each column is directly queryable without nested access patterns.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Recommended for&lt;/strong&gt;: Analytics tools that prefer flat schemas when your DynamoDB items have a reasonably consistent structure. Note that deeply nested or highly variable schemas can produce very wide tables.&lt;/p&gt; 
&lt;h2&gt;Data partitioning&lt;/h2&gt; 
&lt;p&gt;You can speed up your queries and reduce costs by partitioning your replicated data. Partitioning divides data into logical segments on disk.&lt;/p&gt; 
&lt;p&gt;When you include a filter on a partition column in your query, the query engine skips irrelevant segments entirely. This behavior is called &lt;em&gt;partition pruning&lt;/em&gt;: instead of scanning the entire dataset, the engine reads only the data that matches your filter conditions. For large tables, partition pruning can reduce both query runtime and cost significantly.&lt;/p&gt; 
&lt;h3&gt;Default partitioning&lt;/h3&gt; 
&lt;p&gt;If you don’t specify partition columns, AWS Glue Zero-ETL partitions data using the DynamoDB primary key with bucketing. This approach supports general-purpose queries without requiring manual configuration. For specific query patterns or performance requirements, you can define custom partitioning strategies described in the subsections that follow.&lt;/p&gt; 
&lt;h3&gt;Identity partitioning&lt;/h3&gt; 
&lt;p&gt;Identity partitioning uses raw column values to create partitions. You apply this strategy to low-to-medium cardinality columns such as brand, category, or AWS Region. To partition the product table by &lt;code&gt;productdetails.brand&lt;/code&gt; and create a separate partition for each brand, use this configuration:&lt;/p&gt; 
&lt;pre&gt;&lt;code class="lang-json"&gt;{
  "partitionSpec": [
    {
      "fieldName": "productdetails.brand",
      "functionSpec": "identity"
    }
  ]
}&lt;/code&gt;&lt;/pre&gt; 
&lt;p&gt;With this setup, AWS Glue creates one partition directory per unique brand value. When you query for a specific brand, Athena reads only that partition.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Important: &lt;/strong&gt;Avoid identity partitioning on high-cardinality columns such as primary keys or timestamps. This creates many small partitions, which degrades both ingestion and query performance&lt;/p&gt; 
&lt;h3&gt;Time-based partitioning&lt;/h3&gt; 
&lt;p&gt;Time-based partitioning organizes data by timestamp at a chosen granularity: &lt;code&gt;year&lt;/code&gt;, &lt;code&gt;month&lt;/code&gt;, &lt;code&gt;day&lt;/code&gt;, or &lt;code&gt;hour&lt;/code&gt;. You apply this strategy to time-series data and time-range queries. To partition the product table by month on the &lt;code&gt;created_at&lt;/code&gt; column, which stores epoch milliseconds, use this configuration:&lt;/p&gt; 
&lt;pre&gt;&lt;code class="lang-json"&gt;{
  "partitionSpec": [
    {
      "fieldName": "created_at",
      "functionSpec": "month",
      "conversionSpec": "epoch_milli"
    }
  ]
}&lt;/code&gt;&lt;/pre&gt; 
&lt;p&gt;The &lt;code&gt;conversionSpec&lt;/code&gt; parameter tells AWS Glue how to interpret the source timestamp. Supported values: &lt;code&gt;epoch_sec&lt;/code&gt; (Unix seconds), &lt;code&gt;epoch_milli&lt;/code&gt; (Unix milliseconds), and &lt;code&gt;iso&lt;/code&gt; (ISO 8601 format).&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Note: &lt;/strong&gt;The original column values remain unchanged. AWS Glue transforms only the partition column values to timestamp type in the target table&lt;/p&gt; 
&lt;h3&gt;Multi-level partitioning&lt;/h3&gt; 
&lt;p&gt;You can combine strategies for a hierarchical scheme. To partition first by month and then by brand, use this configuration:&lt;/p&gt; 
&lt;pre&gt;&lt;code class="lang-json"&gt;{
  "partitionSpec": [
    {
      "fieldName": "created_at",
      "functionSpec": "month",
      "conversionSpec": "epoch_milli"
    },
    {
      "fieldName": "productdetails.brand",
      "functionSpec": "identity"
    }
  ]
}
&lt;/code&gt;&lt;/pre&gt; 
&lt;p&gt;This scheme supports efficient queries that filter by date range, brand, or both. Place higher-selectivity columns first in the hierarchy and align the scheme with your most common query patterns.&lt;/p&gt; 
&lt;h2&gt;Best practices&lt;/h2&gt; 
&lt;p&gt;Keep these guidelines in mind when you configure your integration:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Avoid identity partitioning on high-cardinality columns&lt;/strong&gt; such as primary keys, timestamps, or system-generated IDs. This leads to partition explosion and degrades performance.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Apply only one time-based function per column.&lt;/strong&gt; For example, don’t partition col1 by year, month, day, and hour simultaneously.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Match &lt;/strong&gt;&lt;code&gt;conversionSpec&lt;/code&gt;&lt;strong&gt; to your actual data format.&lt;/strong&gt; If your timestamps are in epoch milliseconds, use &lt;code&gt;epoch_milli&lt;/code&gt;, not &lt;code&gt;epoch_sec&lt;/code&gt; or &lt;code&gt;iso&lt;/code&gt;.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Choose granularity based on data volume.&lt;/strong&gt; High-volume tables benefit from finer granularity (day or hour). Lower-volume tables work well with coarser granularity (month or year).&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Account for timezone implications with ISO timestamps.&lt;/strong&gt; AWS Glue Zero-ETL normalizes timestamp partition values to UTC.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;Prerequisites&lt;/h2&gt; 
&lt;p&gt;To implement the AWS Glue Zero-ETL integration with a DynamoDB source, you will need:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;An AWS account with least privilege principle&lt;/li&gt; 
 &lt;li&gt;An AWS Glue database (for example, &lt;code&gt;ddb_zero_etl_demo_db&lt;/code&gt;) with an Amazon S3 bucket associated as the database location (&lt;a href="https://docs.aws.amazon.com/glue/latest/dg/zero-etl-prerequisites.html#zero-etl-setup-target-resources-glue-database" target="_blank" rel="noopener noreferrer"&gt;setup instructions&lt;/a&gt;)&lt;/li&gt; 
 &lt;li&gt;AWS Glue Data Catalog settings updated with an AWS Identity and Access Management (IAM) policy that grants fine-grained access control for zero-ETL (&lt;a href="https://docs.aws.amazon.com/glue/latest/dg/zero-etl-prerequisites.html#zero-etl-setup-target-resources-glue-database" target="_blank" rel="noopener noreferrer"&gt;setup instructions&lt;/a&gt;)&lt;/li&gt; 
 &lt;li&gt;Create an IAM role named &lt;code&gt;zetl-role&lt;/code&gt;, to be used by zero-ETL to access data from your DynamoDB table&lt;/li&gt; 
 &lt;li&gt;A DynamoDB source table (for example, &lt;code&gt;product&lt;/code&gt;) configured for zero-ETL integration (&lt;a href="https://docs.aws.amazon.com/glue/latest/dg/zero-etl-sources.html#zero-etl-config-source-dynamodb" target="_blank" rel="noopener noreferrer"&gt;setup instructions&lt;/a&gt;)&lt;/li&gt; 
&lt;/ol&gt; 
&lt;h2&gt;Walkthrough: Create the zero-ETL integration&lt;/h2&gt; 
&lt;p&gt;Complete these steps to create a zero-ETL integration with DynamoDB as the source and Apache Iceberg tables in Amazon S3 as the target.&lt;/p&gt; 
&lt;h3&gt;Step 1: Select the source type&lt;/h3&gt; 
&lt;ol&gt; 
 &lt;li&gt;Open the AWS Glue console.&lt;/li&gt; 
 &lt;li&gt;In the navigation pane, under &lt;strong&gt;Data Integration and ETL&lt;/strong&gt;, choose &lt;strong&gt;Zero-ETL integrations&lt;/strong&gt;.&lt;/li&gt; 
 &lt;li&gt;Choose &lt;strong&gt;Create zero-ETL integration&lt;/strong&gt;.&lt;/li&gt; 
 &lt;li&gt;Select &lt;strong&gt;Amazon DynamoDB&lt;/strong&gt; as the source type, then choose &lt;strong&gt;Next&lt;/strong&gt;.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5703-image-2.png"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90948" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5703-image-2.png" alt="AWS Glue console showing Step 1 of creating a Zero-ETL integration — selecting a source type from 14 available data sources including Amazon DynamoDB, Facebook Ads, Instagram Ads, MySQL, Oracle, PostgreSQL, and Microsoft SQL Server" width="3432" height="1648"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;[Figure 1: Selecting Amazon DynamoDB as the zero-ETL source type]&lt;/em&gt;&lt;/p&gt; 
&lt;h3&gt;Step 2: Configure source and target&lt;/h3&gt; 
&lt;ol&gt; 
 &lt;li&gt;In &lt;strong&gt;Source details&lt;/strong&gt;, select your DynamoDB table (for example, &lt;code&gt;product&lt;/code&gt;).&lt;/li&gt; 
 &lt;li&gt;In &lt;strong&gt;Target details&lt;/strong&gt;:&lt;/li&gt; 
&lt;/ol&gt; 
&lt;ul&gt; 
 &lt;li style="list-style-type: none"&gt; 
  &lt;ul&gt; 
   &lt;li&gt;Select the current account as target.&lt;/li&gt; 
   &lt;li&gt;Choose the catalog and target database (for example, &lt;code&gt;ddb_zero_etl_demo_db&lt;/code&gt;).&lt;/li&gt; 
   &lt;li&gt;Select the IAM role (for example, &lt;code&gt;zetl-role&lt;/code&gt;).&lt;/li&gt; 
  &lt;/ul&gt; &lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5703-image-3.png"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90949" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5703-image-3.png" alt="AWS Glue console Step 2 — configuring source and target for a zero-ETL integration with Amazon DynamoDB &amp;quot;product&amp;quot; table as source and an AWS Glue catalog database &amp;quot;ddb_zero_etl_demo_db&amp;quot; as target" width="3234" height="1622"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;[Figure 2: Configuring source DynamoDB table and target database]&lt;/em&gt;&lt;/p&gt; 
&lt;h3&gt;Step 3: Configure output settings&lt;/h3&gt; 
&lt;ol&gt; 
 &lt;li&gt;Under &lt;strong&gt;Schema unnesting&lt;/strong&gt;, select &lt;strong&gt;Unnest all fields&lt;/strong&gt;.&lt;/li&gt; 
 &lt;li&gt;Under &lt;strong&gt;Data partitioning&lt;/strong&gt;, select &lt;strong&gt;Specify custom partition keys&lt;/strong&gt;.&lt;/li&gt; 
 &lt;li&gt;Enter the partition key (for example, &lt;code&gt;productdetails.brand&lt;/code&gt;) and set the function to &lt;strong&gt;Identity&lt;/strong&gt;.&lt;/li&gt; 
 &lt;li&gt;Choose &lt;strong&gt;Next&lt;/strong&gt;.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5703-image-4.png"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90950" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5703-image-4.png" alt="AWS Glue Zero-ETL integration output settings showing schema unnesting set to &amp;quot;Unnest all fields,&amp;quot; custom partition key &amp;quot;productdetails.brand&amp;quot; configured with Identity function, and target table named &amp;quot;product." width="3404" height="1664"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;[Figure 3: Configuring schema unnesting and partition key settings]&lt;/em&gt;&lt;/p&gt; 
&lt;h3&gt;Step 4: Set integration details&lt;/h3&gt; 
&lt;ol&gt; 
 &lt;li&gt;Optionally configure encryption and replication settings. The default refresh interval is 15 minutes.&lt;/li&gt; 
 &lt;li&gt;Enter a name for the integration (for example, &lt;code&gt;ddb-zero-etl-demo&lt;/code&gt;).&lt;/li&gt; 
 &lt;li&gt;Choose &lt;strong&gt;Next&lt;/strong&gt;.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5703-image-5.png"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90951" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5703-image-5.png" alt="AWS Glue Zero-ETL integration Step 3 — configuring security with AWS managed KMS key, replication refresh interval set to 15 minutes, and integration named &amp;quot;ddb-zero-etl-demd" width="3422" height="1670"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;[Figure 4: Configuring encryption and replication settings]&lt;/em&gt;&lt;/p&gt; 
&lt;h3&gt;Step 5: Review and create&lt;/h3&gt; 
&lt;ol&gt; 
 &lt;li&gt;Review your settings and choose &lt;strong&gt;Create and launch integration&lt;/strong&gt;.&lt;/li&gt; 
 &lt;li&gt;The integration shows as &lt;strong&gt;Active&lt;/strong&gt; within about a minute.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5703-image-6.png"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90952" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5703-image-6.png" alt="AWS Glue Zero-ETL integration Step 4: Review and Create — showing DynamoDB &amp;quot;product&amp;quot; table as source, Glue database &amp;quot;zett_target&amp;quot; as target with IAM role &amp;quot;zett-role,&amp;quot; and partition key &amp;quot;productdetails.brand&amp;quot; with Identity function" width="3440" height="1682"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;[Figure 5: Review and create summary]&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5703-image-7.png"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90953" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5703-image-7.png" alt="AWS Glue Zero-ETL Integration Details page showing &amp;quot;ddb-zero-etl-demo-test&amp;quot; integration with status &amp;quot;Creating,&amp;quot; DynamoDB &amp;quot;product&amp;quot; table as source, Glue database &amp;quot;ddb_zero_etl_demo_db&amp;quot; as target, and a 15-minute refresh interval" width="3428" height="1554"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;[Figure 6: Integration active with successful status]&lt;/em&gt;&lt;/p&gt; 
&lt;h2&gt;Query the replicated data&lt;/h2&gt; 
&lt;p&gt;After the integration is active and the initial replication completes (typically 15–30 minutes), you can query the data in Amazon Athena.&lt;/p&gt; 
&lt;h3&gt;Preview the replicated data&lt;/h3&gt; 
&lt;ol&gt; 
 &lt;li&gt;Open the Amazon Athena console.&lt;/li&gt; 
 &lt;li&gt;In the query editor, select your target database (for example, &lt;code&gt;ddb_zero_etl_demo_db&lt;/code&gt;).&lt;/li&gt; 
 &lt;li&gt;Run a preview query:&lt;/li&gt; 
&lt;/ol&gt; 
&lt;pre&gt;&lt;code class="lang-sql"&gt;SELECT * FROM "ddb_zero_etl_demo_db"."product"LIMIT 10;&lt;/code&gt;&lt;/pre&gt; 
&lt;h3&gt;Verify schema unnesting&lt;/h3&gt; 
&lt;p&gt;With &lt;strong&gt;Unnest all fields&lt;/strong&gt; selected, nested attributes appear as individual columns with dot notation:&lt;/p&gt; 
&lt;pre&gt;&lt;code class="lang-sql"&gt;SELECT "productdetails.brand", "productdetails.category", "pricing.list_price" 
FROM "ddb_zero_etl_demo_db"."product"
WHERE "productdetails.category" = 'Electronics';&lt;/code&gt;&lt;/pre&gt; 
&lt;h3&gt;Verify partition pruning&lt;/h3&gt; 
&lt;p&gt;Queries that filter on the partition column (&lt;code&gt;productdetails.brand&lt;/code&gt;) automatically skip irrelevant partitions:&lt;/p&gt; 
&lt;pre&gt;&lt;code class="lang-sql"&gt;SELECT product_id, name, "pricing.list_price"
FROM "ddb_zero_etl_demo_db"."product"
WHERE "productdetails.brand" = 'AudioTech';&lt;/code&gt;&lt;/pre&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5703-image-8.png"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90954" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5703-image-8.png" alt="Amazon Athena Query Editor showing a completed SQL query selecting brand, category, and product ID from a DynamoDB zero-ETL Glue catalog table, returning two results: Samsung SmartPhone P22445 and TechCo SmartPhone P12345" width="3392" height="1622"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;[Figure 7: Athena query to &lt;/em&gt;retrieve&lt;em&gt; the data from Apache Iceberg lakehouse]&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;You can verify the partition structure by navigating to the Amazon S3 bucket associated with your database. The data organizes into directories like:&lt;/p&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5703-image-9.png"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90955" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5703-image-9.png" alt="Amazon S3 bucket browser showing the &amp;quot;data/&amp;quot; folder in &amp;quot;ddb-zero-etl-demo-bucket&amp;quot; with two partitioned folders: &amp;quot;productdetails.brand=Samsung/&amp;quot; and &amp;quot;productdetails.brand=TechCo/&amp;quot; — confirming Iceberg partition structure from DynamoDB zero-ETL integration" width="3438" height="1404"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;[Figure 8: Amazon S3 bucket organization for the identity partition productdetails.brand]&lt;/em&gt;&lt;/p&gt; 
&lt;h2&gt;Clean up&lt;/h2&gt; 
&lt;p&gt;To avoid ongoing charges, delete the resources in this order:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;&lt;strong&gt;Delete the zero-ETL integration.&lt;/strong&gt; In the AWS Glue console, navigate to &lt;strong&gt;Zero-ETL integrations&lt;/strong&gt;, select your integration, and choose &lt;strong&gt;Delete&lt;/strong&gt;. Existing replicated data remains in the target, but new changes stop replicating.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Delete the replicated table.&lt;/strong&gt; In the AWS Glue Data Catalog, navigate to &lt;strong&gt;Tables&lt;/strong&gt;, select the replicated table, and delete it.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Delete the AWS Glue database.&lt;/strong&gt; In the Data Catalog, select the database and delete it.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Delete the Amazon S3 data.&lt;/strong&gt; Empty and delete the S3 bucket associated with the database.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Delete the DynamoDB table.&lt;/strong&gt; If you created it for this walkthrough, delete the source table.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Delete IAM resources.&lt;/strong&gt; Remove the IAM role and policies created for the integration.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;h2&gt;Conclusion&lt;/h2&gt; 
&lt;p&gt;You configured schema unnesting and data partitioning for a DynamoDB zero-ETL integration, replicated a product catalog table to Apache Iceberg tables in Amazon S3, and verified the results in Amazon Athena. Unnesting flattened nested attributes into directly queryable columns. Partitioning helped the query engine skip irrelevant data, reducing both query time and cost. To take your integration further, try monitoring replication lag and data freshness with Amazon CloudWatch metrics. You can also experiment with different partitioning strategies on a staging table before applying them to production workloads, testing time-based partitioning alongside identity partitioning to find the optimal scheme for your query patterns. For broader analytics coverage, query the same Iceberg tables from Amazon Redshift Spectrum or Amazon EMR alongside Athena. For more details, explore these resources:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/glue/latest/dg/zero-etl-using.html" target="_blank" rel="noopener noreferrer"&gt;AWS Glue Zero-ETL integrations&lt;/a&gt;&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/glue/latest/dg/zero-etl-monitoring.html" target="_blank" rel="noopener noreferrer"&gt;Monitoring zero-ETL integrations&lt;/a&gt;&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/athena/latest/ug/what-is.html" target="_blank" rel="noopener noreferrer"&gt;Amazon Athena documentation&lt;/a&gt;&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Introduction.html" target="_blank" rel="noopener noreferrer"&gt;Amazon DynamoDB Developer Guide&lt;/a&gt;&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/prescriptive-guidance/latest/apache-iceberg-on-aws/introduction.html" target="_blank" rel="noopener noreferrer"&gt;Apache Iceberg on AWS&lt;/a&gt;&lt;/li&gt; 
&lt;/ul&gt; 
&lt;hr style="width: 80%"&gt; 
&lt;h2&gt;About the authors&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5703-image-10.png"&gt;&lt;img loading="lazy" class="alignleft size-full wp-image-90956" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5703-image-10.png" alt="" width="100" height="132"&gt;&lt;/a&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Raju Ansari&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/rajuansari/"&gt;Raju&lt;/a&gt; is a Senior Software Development Engineer at AWS, specializing in building scalable, secure, serverless solutions that simplify data analytics and AI agent development. He helps organizations modernize their data analytics infrastructure and develop cutting-edge AI agentic applications. Currently, Raju focuses on building foundational AI services, including Amazon Bedrock Agents, which enable developers to create intelligent, autonomous applications at scale. Outside of work, Raju is passionate about giving back to the tech community. He actively volunteers at IEEE events and mentor early and mid-career professionals&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5703-image-11.jpeg"&gt;&lt;img loading="lazy" class="alignleft size-full wp-image-90957" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/BDB-5703-image-11.jpeg" alt="" width="100" height="133"&gt;&lt;/a&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Shashank Sharma&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/shashankkumarsharma/"&gt;Shashank&lt;/a&gt; is an Engineering Leader with over 15 years of experience delivering data integration and replication solutions for first-party and third-party databases and SaaS for enterprise customers. He leads engineering for AWS Glue Zero-ETL and Amazon AppFlow, building fully managed pipelines that replicate data from sources like Salesforce, SAP, DynamoDB, and Oracle into Amazon Redshift and Apache Iceberg-based data lakes. Shashank advises startups on technology strategy and mentors engineers and technical leaders at various career stages&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
					
			
		
		
			</item>
		<item>
		<title>How to build a cross-Region resilience for Amazon OpenSearch Service with Amazon MSK</title>
		<link>https://aws.amazon.com/blogs/big-data/how-to-build-a-cross-region-resilience-for-amazon-opensearch-service-with-amazon-msk/</link>
					
		
		<dc:creator><![CDATA[Sriharsha Subramanya Begolli]]></dc:creator>
		<pubDate>Mon, 11 May 2026 18:46:43 +0000</pubDate>
				<category><![CDATA[Amazon Managed Streaming for Apache Kafka (Amazon MSK)]]></category>
		<category><![CDATA[Amazon OpenSearch Ingestion]]></category>
		<category><![CDATA[Amazon OpenSearch Serverless]]></category>
		<category><![CDATA[Industries]]></category>
		<category><![CDATA[Intermediate (200)]]></category>
		<category><![CDATA[Technical How-to]]></category>
		<guid isPermaLink="false">324ae3147cd5c4fdaa603ee2a1e41ea1e7599e94</guid>

					<description>In this post, we outline the solution that provides cross-Region resiliency without needing to reestablish relationships during a fail-back, using an active-active replication model with Amazon OpenSearch Ingestion (OSI) and Amazon Managed Streaming for Apache Kafka (Amazon MSK). This solution applies to both OpenSearch Service managed clusters and Amazon OpenSearch Serverless collections. We use Amazon OpenSearch Serverless as an example for the configurations in this post.</description>
										<content:encoded>&lt;p&gt;Cross-Region resilience for &lt;a href="https://aws.amazon.com/opensearch-service/" target="_blank" rel="noopener noreferrer"&gt;Amazon OpenSearch Service&lt;/a&gt; has historically been a complex challenge, relying on &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/managedomains-snapshot-create.html" target="_blank" rel="noopener noreferrer"&gt;S3-based snapshots&lt;/a&gt; or &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/replication.html" target="_blank" rel="noopener noreferrer"&gt;cross-cluster replication&lt;/a&gt; that demand intricate manual failover procedures often resulting in hours of downtime, data inconsistencies, and significant lag during outages, or other operational disruptions. To overcome these limitations and help businesses stay focused on their core objectives, we’ve developed a solution that automatically maintains synchronized data across AWS Regions while supporting active-active operations in both AWS Regions.&lt;/p&gt; 
&lt;p&gt;AWS offers two &lt;a href="https://opensearch.org/" target="_blank" rel="noopener noreferrer"&gt;OpenSearch&lt;/a&gt; offerings, namely &lt;a href="https://aws.amazon.com/opensearch-service/" target="_blank" rel="noopener noreferrer"&gt;Amazon OpenSearch Service&lt;/a&gt;, a managed cluster-based service where you provision and manage OpenSearch domains (nodes, storage, scaling), and &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless.html" target="_blank" rel="noopener noreferrer"&gt;Amazon OpenSearch Serverless&lt;/a&gt;, a serverless option where AWS automatically manages infrastructure and scaling and you create collections for your search or analytics workloads. OpenSearch Service provides high availability (HA) within an AWS Region through its Multi-AZ deployment model and provides Regional resiliency with &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/replication.html" target="_blank" rel="noopener noreferrer"&gt;cross-cluster replication&lt;/a&gt;. &lt;a href="https://docs.aws.amazon.com/msk/latest/developerguide/msk-replicator.html" target="_blank" rel="noopener noreferrer"&gt;Amazon Managed Streaming for Apache Kafka (Amazon MSK) Replicator&lt;/a&gt; is an Amazon MSK feature that you can use to reliably replicate data across Amazon MSK clusters in different or the same AWS Region.&lt;/p&gt; 
&lt;p&gt;In this post, we outline the solution that provides cross-Region resiliency without needing to reestablish relationships during a fail-back, using an &lt;a href="https://aws.amazon.com/blogs/architecture/disaster-recovery-dr-architecture-on-aws-part-iv-multi-site-active-active/" target="_blank" rel="noopener noreferrer"&gt;active-active replication model&lt;/a&gt; with &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/ingestion.html" target="_blank" rel="noopener noreferrer"&gt;Amazon OpenSearch Ingestion (OSI)&lt;/a&gt; and &lt;a href="https://aws.amazon.com/msk/" target="_blank" rel="noopener noreferrer"&gt;Amazon Managed Streaming for Apache Kafka (Amazon MSK).&lt;/a&gt; This solution applies to both OpenSearch Service managed clusters and Amazon OpenSearch Serverless collections. We use Amazon OpenSearch Serverless as an example for the configurations in this post.&lt;/p&gt; 
&lt;h2&gt;&lt;strong&gt;Solution overview&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;In this solution we use Amazon MSK Replicator for bidirectional cross-Region data replication, with OSI pipelines to index data into Amazon OpenSearch Serverless collections in each AWS Region. While the &lt;a href="https://aws.amazon.com/blogs/big-data/achieve-cross-region-resilience-with-amazon-opensearch-ingestion/" target="_blank" rel="noopener noreferrer"&gt;S3 based approach&lt;/a&gt; serves the purpose, Amazon MSK Replicator provides near real-time replication with identical topic naming, which supports active-active operations. Amazon MSK Replicator provides automatic loop prevention and consumer group offset synchronization, enabling seamless cross-Region failover. You can find the code for the entire solution in the GitHub &lt;a href="https://github.com/aws-samples/sample-opensearch-cross-region-resilience-with-msk/tree/main" target="_blank" rel="noopener noreferrer"&gt;repo&lt;/a&gt;.&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90797" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/bdb-5738-image-1.png" alt="" width="1085" height="599"&gt;Your architecture will follow a Regional-first approach where data sources write to a local Amazon MSK cluster within their AWS Region. In this sample deployment, an &lt;a href="https://aws.amazon.com/lambda/" target="_blank" rel="noopener noreferrer"&gt;AWS Lambda&lt;/a&gt; function serves as the producer, streaming data into the MSK cluster. OSI pipelines consume the incoming data from the local MSK cluster and persist it to an Amazon OpenSearch Serverless collection within the same AWS Region. To achieve cross-Region data synchronization, Amazon MSK Replicator facilitates bidirectional replication between the Amazon MSK clusters, preserving the same topic names across both environments. This design validates that Amazon OpenSearch Serverless collections in each AWS Region maintain identical datasets, provides low-latency search capabilities and high availability for globally distributed workloads.&lt;/p&gt; 
&lt;h2&gt;&lt;strong&gt;Prerequisites&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;Deploy the AWS &lt;a href="https://github.com/aws-samples/sample-opensearch-cross-region-resilience-with-msk/tree/main/cloudformation" target="_blank" rel="noopener noreferrer"&gt;Cloudformation template&lt;/a&gt; to install the prerequisites. The solution has the following prerequisite steps:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;&lt;strong&gt;Set up &lt;/strong&gt;&lt;a href="https://docs.aws.amazon.com/vpc/latest/userguide/what-is-amazon-vpc.html" target="_blank" rel="noopener noreferrer"&gt;&lt;strong&gt;Amazon Virtual Private Cloud (Amazon VPC)&lt;/strong&gt;&lt;/a&gt;&lt;strong&gt; infrastructure in both Regions&lt;/strong&gt; 
  &lt;ol type="a"&gt; 
   &lt;li&gt;Create Amazon VPCs with private subnets in at least two or three Availability Zones for high availability at the AWS Region level&lt;/li&gt; 
   &lt;li&gt;Configure &lt;a href="https://docs.aws.amazon.com/vpc/latest/userguide/vpc-nat-gateway.html" target="_blank" rel="noopener noreferrer"&gt;Network Address Translation (NAT) Gateways&lt;/a&gt; for outbound internet access from &lt;a href="https://docs.aws.amazon.com/vpc/latest/userguide/configure-subnets.html" target="_blank" rel="noopener noreferrer"&gt;private subnets&lt;/a&gt;&lt;/li&gt; 
   &lt;li&gt;Use &lt;a href="https://docs.aws.amazon.com/vpc/latest/userguide/subnet-sizing.html" target="_blank" rel="noopener noreferrer"&gt;non-overlapping CIDR blocks&lt;/a&gt;&lt;/li&gt; 
  &lt;/ol&gt; &lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Establish Amazon OpenSearch Serverless collections in both AWS Regions&lt;/strong&gt;&lt;/li&gt; 
 &lt;li&gt;Create Amazon OpenSearch Serverless Collections for log analytics&lt;/li&gt; 
 &lt;li&gt;Configure &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-encryption.html" target="_blank" rel="noopener noreferrer"&gt;encryption&lt;/a&gt;, &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-network.html" target="_blank" rel="noopener noreferrer"&gt;network&lt;/a&gt;, and &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-data-access.html" target="_blank" rel="noopener noreferrer"&gt;data access policies&lt;/a&gt;&lt;/li&gt; 
 &lt;li&gt;Create &lt;a href="https://docs.aws.amazon.com/vpc/latest/privatelink/create-interface-endpoint.html" target="_blank" rel="noopener noreferrer"&gt;Amazon VPC endpoints&lt;/a&gt; for private access&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Configure MSK clusters in both AWS Regions&lt;/strong&gt;&lt;/li&gt; 
 &lt;li&gt;Enable &lt;a href="https://docs.aws.amazon.com/msk/latest/developerguide/security-iam.html" target="_blank" rel="noopener noreferrer"&gt;AWS Identity and Access Management (IAM) authentication (SASL/IAM)&lt;/a&gt;&lt;/li&gt; 
 &lt;li&gt;Enable &lt;a href="https://docs.aws.amazon.com/msk/latest/developerguide/mvpc-getting-started.html" target="_blank" rel="noopener noreferrer"&gt;Multi-VPC connectivity&lt;/a&gt; (required for Amazon MSK Replicator and OSI)&lt;/li&gt; 
 &lt;li&gt;Configure &lt;a href="https://docs.aws.amazon.com/msk/latest/developerguide/security_iam_service-with-iam.html" target="_blank" rel="noopener noreferrer"&gt;MSK cluster policies&lt;/a&gt; to allow kafka.amazonaws.com and osis-pipelines.amazonaws.com service principals&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Configure IAM permissions for pipeline and replication access&lt;/strong&gt;&lt;/li&gt; 
 &lt;li&gt;Create &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/security-iam-ingestion.html" target="_blank" rel="noopener noreferrer"&gt;IAM roles for the OSI pipelines&lt;/a&gt; with permissions to access Amazon Managed Streaming for Apache Kafka and Amazon OpenSearch Serverless.&lt;/li&gt; 
 &lt;li&gt;Create &lt;a href="https://docs.aws.amazon.com/msk/latest/developerguide/security-iam-awsmanpol-AWSMSKReplicatorExecutionRole.html" target="_blank" rel="noopener noreferrer"&gt;IAM roles for the Amazon MSK Replicator&lt;/a&gt; with permissions for cross-Region access to Amazon Managed Streaming for Apache Kafka clusters.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;a href="https://github.com/aws-samples/sample-opensearch-cross-region-resilience-with-msk/tree/main/cloudformation" target="_blank" rel="noopener noreferrer"&gt;This AWS CloudFormation&lt;/a&gt; template helps you in deploying all of the required configurations with primary AWS Region as &lt;code&gt;us-east-1&lt;/code&gt; and secondary AWS Region as &lt;code&gt;us-west-2&lt;/code&gt;.&lt;/p&gt; 
&lt;p&gt;The following snippets shows the configuration for the OSI pipeline, which writes data from Amazon MSK to Amazon OpenSearch Serverless. The OSI pipeline uses MSK as a source with IAM authentication.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre class="unlimited-height-code"&gt;&lt;code class="lang-yaml"&gt;version: "2"
kafka-pipeline:
source:
kafka:
acknowledgments: true
topics:
- name: "opensearch-data"
group_id: "osi-consumer-group-primary"
aws:
msk:
arn: "arn:aws:kafka:us-east-1:&amp;lt;aws-acccount-id&amp;gt;:cluster/production-msk-primary/CLUSTER_ID"
region: "us-east-1"
sts_role_arn: "arn:aws:iam::&amp;lt;aws-acccount-id&amp;gt;:role/production-osi-pipeline-primary-role"
sink:
- opensearch:
hosts:
- "https://&amp;lt;OPENSEARCH_SERVERLESS_COLLECTION_ID&amp;gt;.us-east-1.aoss.amazonaws.com"
index: "application-logs-${yyyy.MM.dd}"
aws:
serverless: true
region: "us-east-1"
sts_role_arn: "arn:aws:iam::&amp;lt;aws-acccount-id&amp;gt;:role/production-osi-pipeline-primary-role"
dlq:
s3:
bucket: "production-opensearch-dlq-us-east-1"
region: "us-east-1"
sts_role_arn: "arn:aws:iam::&amp;lt;aws-acccount-id&amp;gt;:role/production-osi-pipeline-primary-role"&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;The &lt;a href="https://github.com/aws-samples/sample-opensearch-cross-region-resilience-with-msk/blob/main/docs/iamrole-replicator-config.md" target="_blank" rel="noopener noreferrer"&gt;OSI pipeline IAM Role&lt;/a&gt; has the required permission for Amazon MSK and Amazon OpenSearch Serverless to consume message data from the source and write data to the destination. For true active-active replication, sample deploys &lt;a href="https://github.com/aws-samples/sample-opensearch-cross-region-resilience-with-msk/blob/main/docs/iamrole-replicator-config.md" target="_blank" rel="noopener noreferrer"&gt;two Amazon MSK Replicators&lt;/a&gt; in each AWS Region. Each Amazon MSK cluster requires &lt;a href="https://github.com/aws-samples/sample-opensearch-cross-region-resilience-with-msk/blob/main/docs/iamrole-replicator-config.md" target="_blank" rel="noopener noreferrer"&gt;cluster policy&lt;/a&gt; to allow Amazon MSK Replicator and OSI to connect. To validate the bidirectional replication, the solution uses AWS Lambda functions to produce test messages to both Amazon MSK clusters.&lt;/p&gt; 
&lt;p&gt;When an application generates an event, it first publishes the message to an Apache Kafka topic in the Regional streaming cluster powered by Amazon Managed Streaming for Apache Kafka. In this sample deployment, an AWS Lambda function simulates application activity by producing events into the topic. These events are durably stored in the Apache Kafka partitions, providing a reliable buffer between producers and downstream consumers. An ingestion pipeline built using Amazon OpenSearch Ingestion continuously reads the event stream from the Apache Kafka topic and prepares the data for indexing. The pipeline then indexes the processed events into a collection in Amazon OpenSearch Serverless, making the data searchable in near real time.&lt;/p&gt; 
&lt;p&gt;At the same time, Amazon MSK Replicator replicates the Apache Kafka topic to a peer Amazon MSK cluster in a secondary AWS Region while preserving the topic structure. This makes the same event stream available in the secondary AWS Region without requiring changes to downstream consumers. An OpenSearch Ingestion pipeline in the secondary AWS Region consumes the replicated topic and indexes the events into its local OpenSearch Serverless collection. As events continue to flow through the system, both AWS Regions maintain synchronized datasets that can be queried independently. This architecture enables low-latency Regional search while maintaining a resilient, cross-Region copy of the indexed data.&lt;/p&gt; 
&lt;h2&gt;&lt;strong&gt;Failover scenario and considerations&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;You can failover your application to the Amazon OpenSearch Serverless collection in the other AWS Region and continue operations without interruption. The data present before the impairment is available in both collections. Upon recovery, Amazon MSK Replicator and OSI pipelines automatically resume operations without manual intervention. Data that you write to the healthy AWS Region during the impairment is automatically backfilled to the recovered AWS Region. For detailed step-by-step guidance, see &lt;a href="https://github.com/aws-samples/sample-opensearch-cross-region-resilience-with-msk/blob/main/docs/disaster-recovery-testing.md" target="_blank" rel="noopener noreferrer"&gt;disaster recovery section in GitHub repo&lt;/a&gt;.&lt;/p&gt; 
&lt;p&gt;When using Amazon MSK Replicator, be aware that cross-Region data transfer incurs additional costs. To help verify reliability, configure &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/osis-features-overview.html#osis-features-dlq" target="_blank" rel="noopener noreferrer"&gt;Dead Letter Queues (DLQ) for OSI pipelines&lt;/a&gt; to capture failed document ingestion. Additionally, monitor essential &lt;a href="https://aws.amazon.com/cloudwatch/" target="_blank" rel="noopener noreferrer"&gt;Amazon CloudWatch&lt;/a&gt; metrics including ReplicationLatency for tracking lag between clusters, DocumentsFailed for identifying ingestion issues, and MessagesInPerSec for observing message throughput.&lt;/p&gt; 
&lt;p&gt;Persistent buffering in OSI provides a built-in safety net that prevents data loss when data producers send information faster than your OpenSearch cluster can process it, removing the need to provision and manage separate buffering infrastructure. By using managed storage across multiple Availability Zones, this feature enhances data durability while dynamically allocating &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-scaling.html" target="_blank" rel="noopener noreferrer"&gt;OpenSearch Compute Units (OCUs)&lt;/a&gt; for both buffering and data processing, which incurs additional costs. Persistent buffering isn’t enabled by default. Without it, the OSI pipeline relies on an in-memory buffer, which is volatile and has limited capacity for storing incoming data before processing.&lt;/p&gt; 
&lt;h2&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;In this post, we showed you how to achieve cross-Regional resiliency for Amazon OpenSearch Serverless and OpenSearch Service managed clusters. In our experiments, most writes of a few KBs of data completed within one to a few seconds between the two chosen AWS Regions. Replication lag between the AWS Regions depends on network delay between chosen Regions and the settings configured on Amazon Opensearch Ingestion (OSI) pipeline.&lt;/p&gt; 
&lt;p&gt;Refer to &lt;a href="https://aws.amazon.com/legal/service-level-agreements/" target="_blank" rel="noopener noreferrer"&gt;AWS Service Level Agreements (SLAs)&lt;/a&gt; and &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/ingestion.html" target="_blank" rel="noopener noreferrer"&gt;Amazon Opensearch Ingestion&lt;/a&gt; (OSI) for more details. You can also achieve active-passive replication for OpenSearch using OSI and Amazon Simple Storage Service (Amazon S3) as mentioned in another post &lt;a href="https://aws.amazon.com/blogs/big-data/achieve-cross-region-resilience-with-amazon-opensearch-ingestion/" target="_blank" rel="noopener noreferrer"&gt;Achieve cross-Region resilience with Amazon OpenSearch Ingestion&lt;/a&gt;.&lt;/p&gt; 
&lt;hr style="width: 80%"&gt; 
&lt;h2&gt;About the authors&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignleft wp-image-90798 size-thumbnail" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/bdb-5738-image-2-100x98.png" alt="" width="100" height="98"&gt;
  &lt;/div&gt; 
  &lt;p&gt;&lt;strong&gt;Sriharsha Subramanya Begolli&lt;/strong&gt; works as a Senior Solutions Architect with AWS, based in Bengaluru, India. His primary focus is assisting large enterprise customers in modernising their applications and developing cloud-based systems to meet their business objectives. His expertise lies in the domains of data, analytics and generative AI.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignleft wp-image-90799 size-thumbnail" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/bdb-5738-image-3-100x133.png" alt="" width="100" height="133"&gt;
  &lt;/div&gt; 
  &lt;p&gt;&lt;strong&gt;Qais Poonawala&lt;/strong&gt; is a Senior Technical Account Manager at AWS Enterprise Support, India, who specializes in Cloud Operations and Security while helping customers architect highly scalable, resilient, and secure solutions. With extensive experience in enabling enterprise customers across AWS services, he has a passion for solving complex challenges and developing solutions around Security, Cloud Operations, and GenAI.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignleft wp-image-90800 size-thumbnail" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/05/05/bdb-5738-image-4-100x119.png" alt="" width="100" height="119"&gt;
  &lt;/div&gt; 
  &lt;p&gt;&lt;strong&gt;Jay Jothi &lt;/strong&gt;is a Senior Technical Account Manager based in Chennai, India, where he supports major enterprise customers in maximizing the benefits of cloud technology. With extensive experience in the financial services industry and a specialization in Cloud Operations, he focuses on helping financial clients manage data efficiently, derive actionable insights using GenAI, and deliver cost-effective solutions.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
					
			
		
		
			</item>
		<item>
		<title>How to consolidate cross-Region S3 data into OpenSearch</title>
		<link>https://aws.amazon.com/blogs/big-data/how-to-consolidate-cross-region-s3-data-into-opensearch/</link>
					
		
		<dc:creator><![CDATA[David Venable]]></dc:creator>
		<pubDate>Fri, 08 May 2026 13:37:47 +0000</pubDate>
				<category><![CDATA[Amazon OpenSearch Ingestion]]></category>
		<category><![CDATA[Amazon OpenSearch Service]]></category>
		<category><![CDATA[Intermediate (200)]]></category>
		<category><![CDATA[Technical How-to]]></category>
		<guid isPermaLink="false">050f3f71ba06a80f03ca8dc732f00f739ae8f70f</guid>

					<description>We’re happy to announce that Amazon OpenSearch Ingestion pipelines can now read from S3 buckets in different Regions to ingest and consolidate data into a single OpenSearch Service domain or collection. In this post, I'll show you how to use the new cross-Region support to ingest data from S3 buckets across multiple AWS Regions into a single OpenSearch Service domain or collection.</description>
										<content:encoded>&lt;p&gt;You might have data in &lt;a href="https://aws.amazon.com/s3/" target="_blank" rel="noopener"&gt;Amazon Simple Storage Service&lt;/a&gt; (Amazon S3) buckets in different AWS Regions that you want available in a single &lt;a href="https://aws.amazon.com/opensearch-service/" target="_blank" rel="noopener"&gt;Amazon OpenSearch Service&lt;/a&gt; domain or collection. Consolidating data across Regions provides unified analytics and searches, reduce operation complexity, and streamline your search infrastructure. We’re happy to announce that Amazon OpenSearch Ingestion pipelines can now read from S3 buckets in different Regions to ingest and consolidate data into a single OpenSearch Service domain or collection.&lt;/p&gt; 
&lt;p&gt;To consolidate this data across AWS Regions, you previously had to provide your own solution. Now Amazon OpenSearch Ingestion can help you accomplish this. In this post, I’ll show you how to use the new cross-Region support to ingest data from S3 buckets across multiple AWS Regions into a single OpenSearch Service domain or collection.&lt;/p&gt; 
&lt;p&gt;Amazon OpenSearch Ingestion (OSI) is a feature-rich data ingestion pipeline that you can use for many different purposes: observability, analytics, and zero-ETL search. Many customers use OpenSearch Ingestion to ingest data from Amazon S3 into OpenSearch Service domains and Amazon OpenSearch Serverless collections. Until now, you could only ingest from a single AWS Region at a time. Now that you can use OpenSearch Ingestion for cross-Region S3 ingestion, I’ll show you how you can use it in two scenarios: batch processing using S3 scan, and streaming ingestion using Amazon Simple Queue Service (Amazon SQS) queues for AWS vended logs like Amazon Virtual Private Cloud (Amazon VPC) Flow Logs and AWS CloudTrail.&lt;/p&gt; 
&lt;h2&gt;&lt;strong&gt;Prerequisites&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;Complete the following prerequisite steps:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/createupdatedomains.html" target="_blank" rel="noopener noreferrer"&gt;Deploy an OpenSearch Service domain&lt;/a&gt; or &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-collections.html" target="_blank" rel="noopener noreferrer"&gt;OpenSearch Serverless collection&lt;/a&gt; in the Regions where you want to perform your search or analytics.&lt;/li&gt; 
 &lt;li&gt;You need S3 buckets in at least two different Regions. You can use existing ones or &lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/create-bucket-overview.html" target="_blank" rel="noopener noreferrer"&gt;create S3 buckets&lt;/a&gt;. You can use one in the same AWS Region as your OpenSearch Service domain or collection, or use two completely different Regions.&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/upload-objects.html" target="_blank" rel="noopener noreferrer"&gt;Upload objects&lt;/a&gt; with data into your S3 buckets. The data can be JSON, ND-JSON, Parquet, CSV, or plaintext formats.&lt;/li&gt; 
 &lt;li&gt;Configure &lt;a href="https://aws.amazon.com/iam/" target="_blank" rel="noopener noreferrer"&gt;AWS Identity and Access Management&lt;/a&gt; (IAM) permissions needed for OSI. For instructions, see &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/configure-client-s3.html#s3-source" target="_blank" rel="noopener noreferrer"&gt;Amazon S3 as a source&lt;/a&gt;.&lt;/li&gt; 
 &lt;li&gt;For cross-Region ingestion, you must now also include the s3:GetBucketLocation permission. This gives the pipeline the ability to determine which AWS Region the bucket is located in.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;After you complete these steps, you can either set up your Amazon OpenSearch Ingestion pipelines for batch or streaming scenarios. In the following sections, I’ll give you recommendations on when to choose which approach, and I outline the steps for creating your pipeline.&lt;/p&gt; 
&lt;h2&gt;&lt;strong&gt;Batch scenarios&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;You can use the OpenSearch Ingestion S3 scan capability to read batch data from S3. You might find this approach useful when your data is written to S3 on a schedule. To perform a cross-Region S3 scan, you only specify the buckets that you’re reading from when you create the OpenSearch Ingestion pipeline.&lt;/p&gt; 
&lt;p&gt;The following diagram shows the design for an OpenSearch Ingestion pipeline in &lt;code&gt;us-west-2&lt;/code&gt; reading from S3 buckets in &lt;code&gt;us-east-1&lt;/code&gt; and &lt;code&gt;eu-west-1&lt;/code&gt; and writing that data into an OpenSearch Service domain in &lt;code&gt;us-west-2&lt;/code&gt;.&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone wp-image-90616 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/25/BDB-5804-image-1.jpg" alt="" width="701" height="421"&gt;&lt;/p&gt; 
&lt;p&gt;Next, you will &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/creating-pipeline.html" target="_blank" rel="noopener noreferrer"&gt;create an OpenSearch Ingestion pipeline&lt;/a&gt;. You must create this pipeline in the same Region as your OpenSearch Service domain or collection.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;version: "2"
s3-scan-cross-region:
  source:
    s3:
      compression: automatic
      codec:
        json:
      scan:
        buckets:
          - bucket:
              name: amzn-s3-demo-bucket1
          - bucket:
              name: amzn-s3-demo-bucket2
      aws:
        region: us-west-2

  sink:
    - opensearch:
        hosts: [ "https://search-mydomain-abcdefghijklmn.us-west-2.es.amazonaws.com" ]
        index: s3_scan_cross_region
        aws:
          region: us-west-2
&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;The previous pipeline configuration supports the JSON codec. You might want to &lt;a href="https://docs.opensearch.org/latest/data-prepper/pipelines/configuration/sources/s3/#codec" target="_blank" rel="noopener noreferrer"&gt;configure a different codec&lt;/a&gt; if your data isn’t a large JSON object.&lt;/p&gt; 
&lt;p&gt;You can now query your OpenSearch Service domain or collection to see the data that you ingested.&lt;/p&gt; 
&lt;h2&gt;&lt;strong&gt;Streaming scenarios: AWS vended logs&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;Like many of our customers, you might want to ingest S3 data from different AWS Regions into OpenSearch Service. A common reason is to consolidate AWS vended logs. For example, VPC Flow Logs, CloudTrail data, and load balancer logs. For these scenarios, you can configure OpenSearch Ingestion pipelines to read from an Amazon SQS queue to stream data into your OpenSearch Service domain or collection.&lt;/p&gt; 
&lt;p&gt;These AWS vended logs write to Amazon S3 in the same AWS Region as the service running it. For example, VPC Flow Logs will be in the same AWS Region as your Amazon VPC. You can use OpenSearch Ingestion to consolidate these logs into one AWS Region. In the VPC Flow Logs example, you can consolidate your VPC Flow Logs from multiple AWS Regions into a single OpenSearch Service domain or collection to analyze network patterns from your different Amazon VPCs.&lt;/p&gt; 
&lt;p&gt;The following diagram outlines the overall setup. It shows an example of sending AWS vended logs from &lt;code&gt;us-east-1&lt;/code&gt; and &lt;code&gt;eu-west-1&lt;/code&gt; to an OpenSearch Service domain in &lt;code&gt;us-west-2&lt;/code&gt;. You can change the AWS Regions depending on your specific needs.&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone wp-image-90617 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/25/BDB-5804-image-2.jpg" alt="" width="1001" height="421"&gt;&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;You must configure your vended logs to write log events to Amazon S3 buckets in their respective AWS Regions. Using VPC Flow Logs as our example, you can &lt;a href="https://docs.aws.amazon.com/vpc/latest/userguide/flow-logs.html" target="_blank" rel="noopener noreferrer"&gt;configure VPC Flow Logs for your VPCs&lt;/a&gt;.&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/creating-sqs-standard-queues.html" target="_blank" rel="noopener noreferrer"&gt;Create an Amazon SQS queue&lt;/a&gt; in the same AWS Region as your OpenSearch Service domain.&lt;/li&gt; 
 &lt;li&gt;Amazon S3 doesn’t send notifications to cross-Region Amazon SQS queues, so you will use intermediate Amazon Simple Notification Service (Amazon SNS) topics to consolidate the notifications from multiple Regions into one queue. For each S3 bucket, &lt;a href="https://docs.aws.amazon.com/sns/latest/dg/sns-create-topic.html" target="_blank" rel="noopener noreferrer"&gt;create an SNS topic&lt;/a&gt;.&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/EventNotifications.html" target="_blank" rel="noopener noreferrer"&gt;Configure S3 Event Notifications for SNS&lt;/a&gt;. You will do this for each S3 bucket and each SNS topic.&lt;/li&gt; 
 &lt;li&gt;SNS can send cross-Region notifications to SQS. &lt;a href="https://docs.aws.amazon.com/sns/latest/dg/sns-create-subscribe-endpoint-to-topic.html" target="_blank" rel="noopener noreferrer"&gt;Create a subscription&lt;/a&gt; from each SNS topic that you created in step 3 to the single SQS queue you created in step 2.&lt;/li&gt; 
 &lt;li&gt;Configure your pipeline role to read from SQS and read from the relevant S3 buckets.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;Now &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/creating-pipeline.html" target="_blank" rel="noopener noreferrer"&gt;create an OpenSearch Ingestion pipeline&lt;/a&gt; in the same AWS Region as your OpenSearch Service domain.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;version: "2"
s3-sqs-cross-region:
  source:
    s3:
      notification_type: sqs
      codec:
        newline:
      sqs:
        queue_url: https://sqs.us-west-2.amazonaws.com/123456789012/amzn-s3-demo-all-regions
      aws:
        region: us-west-2

  sink:
    - opensearch:
        hosts: [ "https://search-mydomain-abcdefghijklmn.us-west-2.es.amazonaws.com" ]
        index: s3_sqs_cross_region
        aws:
          region: us-west-2
&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;The previous pipeline configuration supports the JSON codec. You might want to &lt;a href="https://docs.opensearch.org/latest/data-prepper/pipelines/configuration/sources/s3/#codec" target="_blank" rel="noopener noreferrer"&gt;configure a different codec&lt;/a&gt; if your data is not a large JSON object.&lt;/p&gt; 
&lt;p&gt;Next, &lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/upload-objects.html" target="_blank" rel="noopener noreferrer"&gt;upload objects&lt;/a&gt; with data into your S3 buckets. By uploading data, S3 will send notifications to SNS and then the SQS queue.&lt;/p&gt; 
&lt;p&gt;You can now query your OpenSearch Service domain or collection to see the data that you ingested.&lt;/p&gt; 
&lt;p&gt;Here is what makes this possible and what is different. The &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/configure-client-s3.html" target="_blank" rel="noopener noreferrer"&gt;SQS queue receives the event notifications&lt;/a&gt; for the buckets. Before the cross-Region feature of OpenSearch Ingestion, the pipeline could see these events, but couldn’t access the S3 bucket even if the permissions were granted. Now, the pipeline will determine the AWS Region that the bucket is in, access an AWS Security Token Service (AWS STS) token for the AWS Region of the bucket. Using the STS token from the same Region as the S3 bucket allows the pipeline to read and access the data.&lt;/p&gt; 
&lt;h2&gt;&lt;strong&gt;Using the AWS Console&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;When you create the pipeline using the OpenSearch Ingestion console, you will have options to select a blueprint for your use-case. These blueprints help you create pipelines for various vended log types only by selecting your SQS queue and OpenSearch domain. The blueprint handles the data type mappings for you by including appropriate processors. You can use these blueprints as a starting point and modify your processors for your specific requirements.&lt;/p&gt; 
&lt;h2&gt;&lt;strong&gt;Clean up resources&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;When you’re done testing this out, use the following resources to delete the resources that you created.&lt;/p&gt; 
&lt;p&gt;If you set up a batch pipeline:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/delete-pipeline.html" target="_blank" rel="noopener noreferrer"&gt;Delete&lt;/a&gt; the OpenSearch Ingestion pipeline.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;If you set up a streaming pipeline:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/delete-pipeline.html" target="_blank" rel="noopener noreferrer"&gt;Delete&lt;/a&gt; the OpenSearch Ingestion pipeline.&lt;/li&gt; 
 &lt;li&gt;If you created an SQS queue, &lt;a href="https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/step-delete-queue.html" target="_blank" rel="noopener noreferrer"&gt;delete the SQS queue&lt;/a&gt;.&lt;/li&gt; 
 &lt;li&gt;If you created SNS topics, &lt;a href="https://docs.aws.amazon.com/sns/latest/dg/sns-delete-subscription-topic.html" target="_blank" rel="noopener noreferrer"&gt;delete the SNS topics&lt;/a&gt;.&lt;/li&gt; 
 &lt;li&gt;If you configured AWS vended logs you can delete those logging configurations. This example used VPC Flow Logs. For instructions on how to do so, see &lt;a href="https://docs.aws.amazon.com/vpc/latest/userguide/working-with-flow-logs.html#delete-flow-log" target="_blank" rel="noopener noreferrer"&gt;Delete the Flow Logs&lt;/a&gt;.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;For both pipelines, these steps help you delete the common resources.&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_manage_delete.html" target="_blank" rel="noopener noreferrer"&gt;Delete the IAM roles&lt;/a&gt; that you created for your pipeline.&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/DeletingObjects.html" target="_blank" rel="noopener noreferrer"&gt;Delete the S3 objects&lt;/a&gt; that you uploaded and the &lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/delete-bucket.html" target="_blank" rel="noopener noreferrer"&gt;S3 bucket&lt;/a&gt;.&lt;/li&gt; 
 &lt;li&gt;Delete the &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/gsgdeleting.html" target="_blank" rel="noopener noreferrer"&gt;Amazon OpenSearch domain&lt;/a&gt; or the &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/serverless-delete.html" target="_blank" rel="noopener noreferrer"&gt;Amazon OpenSearch Serverless collection&lt;/a&gt;.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;In this post, I showed you how you can use Amazon OpenSearch Ingestion to ingest data from Amazon S3 buckets in different AWS Regions. I showed that this works for both batch scan and streaming scenarios. The feature offers you a straightforward way to consolidate your data from other Regions into one OpenSearch Service domain or collection.&lt;/p&gt; 
&lt;p&gt;To get started with the cross-Region S3 source, refer to the &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/ingestion.html" target="_blank" rel="noopener noreferrer"&gt;OpenSearch Ingestion documentation&lt;/a&gt; or try creating a pipeline from one of our blueprints using the OpenSearch Ingestion console. You can &lt;a href="https://docs.opensearch.org/latest/data-prepper/common-use-cases/codec-processor-combinations/" target="_blank" rel="noopener noreferrer"&gt;read about the codecs&lt;/a&gt; that OpenSearch Ingestion offers for parsing your S3 objects. You can also learn how about the &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/pipeline-config-reference.html" target="_blank" rel="noopener noreferrer"&gt;various processors&lt;/a&gt; that OpenSearch Ingestion offers, so you can transform and enrich your data to meet your needs.&lt;/p&gt; 
&lt;p&gt;You can also use OpenSearch Ingestion for cross-Region and cross-account. To do this, you must grant &lt;a href="https://docs.aws.amazon.com/AmazonS3/latest/userguide/example-walkthroughs-managing-access-example2.html" target="_blank" rel="noopener noreferrer"&gt;cross-account permissions&lt;/a&gt; on your S3 bucket. You must also make some changes to your &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/configure-client-s3.html#fdsf" target="_blank" rel="noopener noreferrer"&gt;pipeline configuration&lt;/a&gt;. Combining what I showed you in this post with the existing cross-account features greatly expands your ingestion options.&lt;/p&gt; 
&lt;p&gt;If you’re ready to take your streaming ingestion analytics to the next level you can read about how to &lt;a href="https://docs.opensearch.org/latest/data-prepper/common-use-cases/metrics-logs/" target="_blank" rel="noopener noreferrer"&gt;generate metrics from logs&lt;/a&gt; and even how to send those derived metrics to &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/configure-client-prometheus.html" target="_blank" rel="noopener noreferrer"&gt;Amazon Managed Service for Prometheus&lt;/a&gt;.&lt;/p&gt; 
&lt;p&gt;Have you tried out the cross-Region capabilities of OpenSearch Ingestion? Share your use-cases and questions in the comments.&lt;/p&gt; 
&lt;hr&gt; 
&lt;h3&gt;About the authors&lt;/h3&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignnone wp-image-90515 size-thumbnail" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/21/BDB-5804-image-3-100x100.jpeg" alt="" width="100" height="100"&gt;
  &lt;/div&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/venabledavid/" target="_blank" rel="noopener noreferrer"&gt;David&lt;/a&gt; is a senior software engineer working on observability in OpenSearch at Amazon Web Services. He is a maintainer on the Data Prepper project.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
					
			
		
		
			</item>
		<item>
		<title>Enable real-time mainframe analytics with Precisely Connect and Amazon S3</title>
		<link>https://aws.amazon.com/blogs/big-data/enable-real-time-mainframe-analytics-with-precisely-connect-and-amazon-s3/</link>
					
		
		<dc:creator><![CDATA[Supreet Padhi, Rochelle Grubbs]]></dc:creator>
		<pubDate>Fri, 08 May 2026 13:29:29 +0000</pubDate>
				<category><![CDATA[Amazon S3 Tables]]></category>
		<category><![CDATA[Analytics]]></category>
		<category><![CDATA[Customer Solutions]]></category>
		<category><![CDATA[Intermediate (200)]]></category>
		<category><![CDATA[Partner solutions]]></category>
		<guid isPermaLink="false">84b11f9d80f5065fddba0ee94737c75e53799c8a</guid>

					<description>In this post, we discuss how you can use Precisely Connect to enable real-time, direct replication of mainframe data to Amazon Simple Storage Service (Amazon S3), and how your organization can extend this foundation using Amazon S3 Tables for advanced analytics.</description>
										<content:encoded>&lt;p&gt;&lt;em&gt;This is a guest post by Supreet Padhi, Technology Architect, Strategic Technologies, and Rochelle Grubbs, Senior Director, Solution Architect at Precisely &lt;/em&gt;&lt;em&gt;in partnership with AWS.&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;Business leaders face a critical challenge to enable real-time analytics. Their most valuable data sits in mainframe systems that reliably process billions of transactions daily, but extracting value for modern analytics and AI remains complex and costly. Traditional mainframe-to-cloud integration approaches require multi-step replication with intermediary systems, creating operational overhead, latency, and data integrity risks. This complexity delays insights, increases infrastructure costs, limits agility, and blocks organizations from using AI and machine learning on their mainframe data.&lt;/p&gt; 
&lt;p&gt;&lt;a href="https://www.precisely.com/" target="_blank" rel="noopener noreferrer"&gt;Precisely&lt;/a&gt;, a global leader in data integrity with over 12,000 customers including 95 of the Fortune 100, has &lt;a href="https://www.precisely.com/press-release/precisely-accelerates-mainframe-modernization-with-real-time-data-replication-to-amazon-s3/" target="_blank" rel="noopener noreferrer"&gt;announced&lt;/a&gt; an expansion of its collaboration with AWS through new enhancements to Precisely Connect. Precisely is an &lt;a href="https://partners.amazonaws.com/partners/001E000000fgBJWIA2/Precisely" target="_blank" rel="noopener noreferrer"&gt;AWS Data and Analytics ISV Competency and AWS Migration and Modernization ISV Competency&lt;/a&gt; partner. Precisely has service specializations in &lt;a href="https://aws.amazon.com/redshift/" target="_blank" rel="noopener noreferrer"&gt;Amazon Redshift&lt;/a&gt; and &lt;a href="https://aws.amazon.com/rds/" target="_blank" rel="noopener noreferrer"&gt;Amazon Relational Database Service (Amazon RDS)&lt;/a&gt;.&lt;/p&gt; 
&lt;p&gt;In &lt;a href="https://aws.amazon.com/blogs/big-data/stream-mainframe-data-to-aws-in-near-real-time-with-precisely-and-amazon-msk/" target="_blank" rel="noopener noreferrer"&gt;Stream mainframe data to AWS in near-real time with Precisely and Amazon MSK&lt;/a&gt;, we showed you how to set up mainframe CDC and the AWS Mainframe Modernization – Data Replication for IBM z/OS Amazon Machine Image (AMI) available in &lt;a href="https://aws.amazon.com/marketplace/pp/prodview-sjnpyprulmohs" target="_blank" rel="noopener noreferrer"&gt;AWS Marketplace&lt;/a&gt;. In this post, we discuss how you can use Precisely Connect to enable real-time, direct replication of mainframe data to &lt;a href="https://aws.amazon.com/s3/" target="_blank" rel="noopener noreferrer"&gt;Amazon Simple Storage Service (Amazon S3)&lt;/a&gt;, and how your organization can extend this foundation using &lt;a href="https://aws.amazon.com/s3/features/tables/" target="_blank" rel="noopener noreferrer"&gt;Amazon S3 Tables&lt;/a&gt; for advanced analytics.&lt;/p&gt; 
&lt;h2&gt;Real-time mainframe data access&lt;/h2&gt; 
&lt;p&gt;Organizations that can connect their mainframe environments with modern cloud platforms can gain advantages through improved agility, reduced operational costs, and enhanced analytics capabilities.For example, moving appropriate analytics and reporting workloads to the cloud can significantly reduce mainframe operational costs while maintaining performance and reliability. Real-time data access makes insights available within seconds rather than waiting for batch processing cycles, enabling faster responses to market changes and customer needs. Eliminating bulk data extracts and intermediary systems also reduces infrastructure and maintenance expenses. This frees IT resources to focus on higher-value initiatives.&lt;/p&gt; 
&lt;p&gt;However, implementing mainframe-to-cloud integrations presents unique technical challenges that require specialized solutions. These include converting mainframe character encoding (EBCDIC) to standard ASCII format and handling mainframe-specific data types such as packed decimal (COMP) fields. You also need to manage the complexity of VSAM (Virtual Storage Access Method) files that can store multiple record types in a single file, and maintain real-time synchronization without impacting mainframe performance.&lt;/p&gt; 
&lt;p&gt;Change Data Capture (CDC) technology addresses these challenges through incremental data movement that eliminates disruptive bulk extracts by streaming only changed data to cloud targets, minimizing system impact and ensuring data currency. Real-time synchronization keeps cloud applications in sync with mainframe systems, enabling immediate insights and responsive operations.&lt;/p&gt; 
&lt;h2&gt;Precisely Connect: Real-time data replication to Amazon S3&lt;/h2&gt; 
&lt;p&gt;With Precisely Connect, you can replicate data directly from mainframes to Amazon S3 in real time, eliminating the need for intermediaries and simplifying modernization.Data flows directly from mainframe sources, including Db2 z/OS, IMS, and VSAM, to Amazon S3, eliminating intermediary steps and reducing both latency and operational complexity. You can move mainframe data directly to Amazon S3 data lakes and analytics platforms without managing complex, multi-step replication processes.&lt;/p&gt; 
&lt;p&gt;The simplicity of this approach reduces maintenance overhead and integration complexity by removing the need for staging servers, middleware, or batch processing systems. After data lands in Amazon S3, it becomes immediately available for downstream AWS workloads. You can use &lt;a href="https://aws.amazon.com/athena/" target="_blank" rel="noopener noreferrer"&gt;Amazon Athena&lt;/a&gt; for SQL queries, &lt;a href="https://aws.amazon.com/glue/" target="_blank" rel="noopener noreferrer"&gt;AWS Glue&lt;/a&gt; for ETL and data cataloging, &lt;a href="https://aws.amazon.com/emr/" target="_blank" rel="noopener noreferrer"&gt;Amazon EMR&lt;/a&gt; for big data processing, &lt;a href="https://aws.amazon.com/sagemaker/ai/" target="_blank" rel="noopener noreferrer"&gt;Amazon SageMaker AI&lt;/a&gt; for machine learning, and &lt;a href="https://aws.amazon.com/quick/quicksight/" target="_blank" rel="noopener noreferrer"&gt;Amazon Quick Sight&lt;/a&gt; for business intelligence dashboards.&lt;/p&gt; 
&lt;h2&gt;Solution overview&lt;/h2&gt; 
&lt;p&gt;Here we present a solution architecture for streaming mainframe data changes from Db2z through &lt;a href="https://aws.amazon.com/marketplace/pp/prodview-sjnpyprulmohs" target="_blank" rel="noopener noreferrer"&gt;AWS Mainframe Modernization – Data Replication for IBM z/OS&lt;/a&gt; AMI directly to Amazon S3 and then using &lt;a href="https://aws.amazon.com/s3/features/tables/" target="_blank" rel="noopener noreferrer"&gt;Amazon S3 Tables&lt;/a&gt; for advanced analytics capabilities.&lt;/p&gt; 
&lt;p&gt;By introducing direct S3 replication and streamlining deployment through the pre-configured AWS Marketplace AMI, you can deploy in minutes rather than weeks. This creates new possibilities for data distribution, transformation, and consumption. This architecture offers several key benefits:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;&lt;strong&gt;Simplified deployment&lt;/strong&gt; – Accelerate implementation using the preconfigured AWS Marketplace AMI&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Direct replication&lt;/strong&gt; – Eliminate intermediary systems by streaming data directly to Amazon S3, reducing latency and operational overhead&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Real-time synchronization&lt;/strong&gt; – Capture changes as they occur on the mainframe, ensuring downstream applications operate on current data&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Flexible analytics options&lt;/strong&gt; – Use S3 Tables for Iceberg-compatible tabular data storage&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Comprehensive AWS integration&lt;/strong&gt; – Gain immediate access to Amazon EMR, Amazon Athena, AWS Glue, Amazon SageMaker AI, and Amazon Quick Sight&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Natural language data access&lt;/strong&gt; – Through the &lt;a href="https://github.com/awslabs/mcp/tree/main/src/s3-tables-mcp-server" target="_blank" rel="noopener noreferrer"&gt;MCP Server for Amazon S3 Tables&lt;/a&gt;, AI assistants can interact with structured data using conversational interfaces without needing to write SQL queries.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;h2&gt;Prerequisites&lt;/h2&gt; 
&lt;p&gt;To complete the solution, you need the following prerequisites:&lt;/p&gt; 
&lt;h3&gt;Precisely components&lt;/h3&gt; 
&lt;ol&gt; 
 &lt;li&gt;&lt;a href="https://aws.amazon.com/marketplace/pp/prodview-sjnpyprulmohs" target="_blank" rel="noopener noreferrer"&gt;AWS Mainframe Modernization – Data Replication for IBM z/OS&lt;/a&gt; – Deploy this Precisely Connect AMI from AWS Marketplace. This pre-configured image contains the Apply Engine and Controller Daemon components required for replicating mainframe data changes to Amazon S3.&lt;/li&gt; 
 &lt;li&gt;Precisely Connect CDC Capture/Publisher – Deploy the Precisely Connect CDC Capture/Publisher on your mainframe environment. This component captures changes from Db2z logs and streams them to the Apply Engine over TCP/IP.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;For detailed setup and configuration steps for Precisely components, refer to our previous post &lt;a href="https://aws.amazon.com/blogs/big-data/stream-mainframe-data-to-aws-in-near-real-time-with-precisely-and-amazon-msk/" target="_blank" rel="noopener noreferrer"&gt;Stream mainframe data to AWS in near-real time with Precisely and Amazon MSK&lt;/a&gt;.&lt;/p&gt; 
&lt;h3&gt;Connectivity requirements&lt;/h3&gt; 
&lt;ol&gt; 
 &lt;li&gt;Have network connectivity established between your mainframe environment and AWS using your organization’s approved connectivity method (such as &lt;a href="https://aws.amazon.com/directconnect/" target="_blank" rel="noopener noreferrer"&gt;AWS Direct Connect&lt;/a&gt; or VPN).&lt;/li&gt; 
 &lt;li&gt;Verify that firewall rules allow TCP/IP communication between the mainframe Capture/Publisher and the Apply Engine.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;h3&gt;AWS analytics components (optional extension)&lt;/h3&gt; 
&lt;p&gt;After mainframe data lands in Amazon S3, your organization can extend its analytics capabilities using AWS services. One approach is to use Amazon EMR streaming jobs to process and write data to &lt;a href="https://aws.amazon.com/s3/features/tables/" target="_blank" rel="noopener noreferrer"&gt;Amazon S3 Tables&lt;/a&gt;. After the data is stored in S3 Tables, the data can be queried directly using &lt;a href="https://aws.amazon.com/athena/" target="_blank" rel="noopener noreferrer"&gt;Amazon Athena&lt;/a&gt; for ad-hoc SQL analysis. This extension is optional and represents one of several ways to consume and analyze mainframe data after it reaches Amazon S3.&lt;/p&gt; 
&lt;p&gt;The following diagram illustrates the solution architecture.&lt;/p&gt; 
&lt;p&gt;&lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/28/image-BDB-5540-1.png"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90677" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/28/image-BDB-5540-1.png" alt="image-BDB-5540-1-architecture" width="1598" height="625"&gt;&lt;/a&gt;&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;&lt;strong&gt;Capture/Publisher&lt;/strong&gt; – Connect CDC Capture/Publisher captures Db2 changes from Db2 logs using IFI 306 Read and communicates captured data changes to a target engine through TCP/IP.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Controller Daemon&lt;/strong&gt; – The Controller Daemon authenticates all connection requests, managing secure communication between the source and target environments.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Apply Engine&lt;/strong&gt; – The Apply Engine receives the changes from the Publisher agent and applies the changed data to the target Amazon S3.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Amazon S3&lt;/strong&gt; – Serves as the scalable data lake foundation where replicated mainframe data lands.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Amazon EMR streaming job&lt;/strong&gt; – As data arrives, an instance of the Amazon EMR streaming job writes the data to target tables in Amazon S3 Tables.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Amazon Athena&lt;/strong&gt; – Queries data stored in Amazon S3 Tables using standard SQL.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;This architecture provides a clean separation between the data capture process and the data consumption process, allowing each to scale independently. When CDC data arrives in Amazon S3, you can use Amazon S3 Tables to store Db2 z/OS, VSAM, and IMS data in an open table format (Apache Iceberg) that is ready for analytics, providing a flexible path to mainframe modernization.&lt;/p&gt; 
&lt;h2&gt;Quantifiable business value&lt;/h2&gt; 
&lt;p&gt;Organizations implementing this solution typically see significant reductions in mainframe operational costs by offloading analytics and reporting workloads to the cloud. The elimination of intermediary infrastructure reduces both capital and operational expenses. The reduced maintenance burden frees IT resources to focus on strategic initiatives rather than managing complex replication systems. Speed and agility improvements are equally significant. Near real-time data availability, measured in seconds to minutes rather than hours to days, enables organizations to respond rapidly to market changes and operational events. The rapid deployment of new analytics use cases without requiring mainframe changes accelerates innovation. Organizations gain access to the full breadth of AWS services that can be used immediately after data lands in Amazon S3.&lt;/p&gt; 
&lt;p&gt;From an analytics and AI perspective, the solution creates a unified data platform that brings together mainframe, cloud-native, and third-party data sources. This unified view enables advanced machine learning on historical and current data, delivering predictive insights that drive proactive decision-making across the organization.&lt;/p&gt; 
&lt;h2&gt;Customer story&lt;/h2&gt; 
&lt;p&gt;A leading global payments provider put this into practice. The payments provider was struggling to generate timely analytics and insights from Point of Sale (POS) transaction data. As one of the world’s largest payment providers, they process hundreds of thousands of transactions per second. Users expect to swipe their card and have their transaction approved in seconds. New architecture was needed to keep up with customer demands and volume. By streaming mission-critical mainframe data directly to AWS in real time using Precisely Connect and landing it in Amazon S3 Tables, the company used storage built on the Apache Iceberg open standard. This approach enables high-performance analytics directly on mainframe data alongside cloud-native sources.&lt;/p&gt; 
&lt;h2&gt;Conclusion&lt;/h2&gt; 
&lt;p&gt;In this post, we demonstrated how Precisely Connect enables real-time, direct data replication from mainframes to Amazon S3, eliminating intermediaries and simplifying mainframe modernization.&lt;/p&gt; 
&lt;p&gt;Your organization can further extend this foundation with &lt;a href="https://aws.amazon.com/s3/features/tables/" target="_blank" rel="noopener noreferrer"&gt;Amazon S3 Tables&lt;/a&gt;, purpose-built storage for Apache Iceberg tables in S3, enabling analytical applications to query the most current mainframe data using tools such as Amazon Athena, Amazon EMR, and Amazon Redshift.&lt;/p&gt; 
&lt;p&gt;Get started by deploying &lt;a href="https://aws.amazon.com/marketplace/pp/prodview-sjnpyprulmohs" target="_blank" rel="noopener noreferrer"&gt;AWS Mainframe Modernization – Data Replication for IBM z/OS&lt;/a&gt; from AWS Marketplace and use Amazon S3 as a target for your mainframe use cases. Learn more about Precisely’s mainframe data integration capabilities at &lt;a href="http://www.precisely.com" target="_blank" rel="noopener noreferrer"&gt;precisely.com&lt;/a&gt;. Contact &lt;a href="https://aws.amazon.com/contact-us/" target="_blank" rel="noopener noreferrer"&gt;AWS&lt;/a&gt; and Precisely experts to discuss your specific modernization challenges and design a proof-of-concept that demonstrates business value quickly.&lt;/p&gt; 
&lt;hr&gt; 
&lt;h2&gt;About the authors&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/28/image-BDB-5540-2.png"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90680" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/28/image-BDB-5540-2.png" alt="image-BDB-5540-2" width="100" height="107"&gt;&lt;/a&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Supreet Padhi&lt;/h3&gt; 
  &lt;p&gt;Supreet is a Technology Architect at Precisely. He has been with Precisely for more than 14 years, with specialty in streaming data use cases and technology, with emphasis on data warehouse architecture. He is responsible for research and development in areas such as Change Data Capture (CDC), streaming ETL, metadata management, and VectorDBs.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/28/image-BDB-5540-3-1.png"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90680" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/28/image-BDB-5540-3-1.png" alt="image-BDB-5540-3" width="100" height="107"&gt;&lt;/a&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Rochelle Grubbs&lt;/h3&gt; 
  &lt;p&gt;Rochelle is a Senior Director and Solution Architect for Precisely’s Data Integration solutions and has been with Precisely for over 11 years. She has spent the last several years focusing on databases, analytics, data trends, data integration, and GenAI. Rochelle is an expert on Precisely’s OEM AWS Mainframe Migration offering and is driven to help customers successfully migrate their applications and workloads to the cloud.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;a href="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/28/image-BDB-5540-4.png"&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90680" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/28/image-BDB-5540-4.png" alt="image-BDB-5540-4" width="100" height="107"&gt;&lt;/a&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Tamara Astakhova&lt;/h3&gt; 
  &lt;p&gt;Tamara is a Sr. Partner Solutions Architect in Data and Analytics at AWS with over two decades of expertise in architecting and developing large-scale data analytics systems. In her current role, she collaborates with strategic partners to design and implement sophisticated AWS-optimized architectures. Her deep technical knowledge and experience make her an invaluable resource in helping organizations transform their data infrastructure and analytics capabilities.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
					
			
		
		
			</item>
		<item>
		<title>Build streaming applications on Amazon Managed Service for Apache Flink with AI-assisted guidance</title>
		<link>https://aws.amazon.com/blogs/big-data/build-streaming-applications-on-amazon-managed-service-for-apache-flink-with-ai-assisted-guidance/</link>
					
		
		<dc:creator><![CDATA[Mazrim Mehrtens]]></dc:creator>
		<pubDate>Wed, 06 May 2026 15:45:57 +0000</pubDate>
				<category><![CDATA[Amazon Managed Service for Apache Flink]]></category>
		<category><![CDATA[Analytics]]></category>
		<category><![CDATA[Announcements]]></category>
		<category><![CDATA[Technical How-to]]></category>
		<guid isPermaLink="false">e939b9e157bb989186598da2f70b5d75d0ac8981</guid>

					<description>In this post, we walk through installing the Power and Skill, using Amazon Kinesis Data Streams to build a Kinesis Data Stream-to-Kinesis Data Stream streaming pipeline, and migrating an existing application to Flink 2.2. You can follow along with this use case to see how the Managed Service for Apache Flink Kiro Power can help you build a resilient, performant application grounded in best practices.</description>
										<content:encoded>&lt;p&gt;Building production-ready &lt;a href="https://flink.apache.org/" target="_blank" rel="noopener noreferrer"&gt;Apache Flink&lt;/a&gt; applications requires learning a complex ecosystem. The learning curve is steep for newcomers, and even experienced Flink developers encounter complexity when scaling applications or troubleshooting production issues. With the new &lt;a href="https://kiro.dev" target="_blank" rel="noopener noreferrer"&gt;Kiro&lt;/a&gt; &lt;a href="https://kiro.dev/powers/" target="_blank" rel="noopener noreferrer"&gt;Power&lt;/a&gt; and &lt;a href="https://agentskills.io/home" target="_blank" rel="noopener noreferrer"&gt;Agent Skill&lt;/a&gt; for &lt;a href="https://docs.aws.amazon.com/managed-flink/latest/java/what-is.html" target="_blank" rel="noopener noreferrer"&gt;Amazon Managed Service for Apache Flink&lt;/a&gt;, you can get AI-assisted guidance for building, improving, and migrating streaming applications directly in your development environment, with recommendations that are grounded in best practices.&lt;/p&gt; 
&lt;p&gt;The Managed Service for Apache Flink Kiro Power and Agent Skill helps you navigate challenges across the Flink application lifecycle. For new development, the tool provides contextual guidance on application architecture, state management patterns, and connector selection. For existing application improvements, it analyzes your existing code to identify performance bottlenecks, reliability risks, and opportunities for improvement. If you’re &lt;a href="https://docs.aws.amazon.com/managed-flink/latest/java/flink-2-2-upgrade-guide.html" target="_blank" rel="noopener noreferrer"&gt;upgrading from Apache Flink 1.x to 2.x&lt;/a&gt;, it detects compatibility issues and provides targeted refactoring steps to modernize your applications.&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="wp-image-90694 size-full aligncenter" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/29/BDB-5913-1-resize.png" alt="" width="600" height="1084"&gt;&lt;/p&gt; 
&lt;p&gt;In this post, we walk through installing the Power and Skill, using &lt;a href="https://aws.amazon.com/kinesis/" target="_blank" rel="noopener noreferrer"&gt;Amazon Kinesis Data Streams&lt;/a&gt; to build a Kinesis Data Stream-to-Kinesis Data Stream streaming pipeline, and migrating an existing application to Flink 2.2. You can follow along with this use case to see how the Managed Service for Apache Flink Kiro Power can help you build a resilient, performant application grounded in best practices.&lt;/p&gt; 
&lt;h2&gt;Solution overview&lt;/h2&gt; 
&lt;p&gt;The &lt;a href="https://github.com/awslabs/managed-service-for-apache-flink-agent-steering-files" target="_blank" rel="noopener noreferrer"&gt;Managed Service for Apache Flink Power/Skill&lt;/a&gt; works across multiple AI development tools, providing the same comprehensive guidance in each:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Kiro&lt;/strong&gt;: Installs as a Power that automatically activates for Flink-related development activities&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://cursor.com/en-US/docs" target="_blank" rel="noopener noreferrer"&gt;&lt;strong&gt;Cursor&lt;/strong&gt;&lt;/a&gt;&lt;strong&gt; and &lt;/strong&gt;&lt;a href="https://code.claude.com/docs/en/overview" target="_blank" rel="noopener noreferrer"&gt;&lt;strong&gt;Claude Code&lt;/strong&gt;&lt;/a&gt;: Installs as an Agent Skill following the open Agent Skills standard&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Other compatible agents&lt;/strong&gt;: Compatible with tools supporting the Agent Skills specification&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;The Power/Skill provides guidance across the development lifecycle:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Best practices for Managed Service for Apache Flink application development&lt;/li&gt; 
 &lt;li&gt;Maven dependency management and project structure&lt;/li&gt; 
 &lt;li&gt;Resource improvements including KPU sizing, parallelism tuning, and checkpointing&lt;/li&gt; 
 &lt;li&gt;Job graph architecture patterns and anti-patterns&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://aws.amazon.com/cloudwatch/" target="_blank" rel="noopener noreferrer"&gt;Amazon CloudWatch&lt;/a&gt; monitoring and logging configuration&lt;/li&gt; 
 &lt;li&gt;Flink 1.x to 2.2 migration guidance with state compatibility assessment&lt;/li&gt; 
 &lt;li&gt;Connector-specific guidelines&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;The content is maintained in a single repository with use case specific entry points that are dynamically loaded depending on your needs.&lt;/p&gt; 
&lt;h2&gt;Prerequisites&lt;/h2&gt; 
&lt;p&gt;To use the tool, you need:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;A development machine running macOS, Linux, or Windows with Java 11 or later (Java 17 for Flink 2.2) and Apache Maven installed&lt;/li&gt; 
 &lt;li&gt;One of the following AI development tools: 
  &lt;ul&gt; 
   &lt;li&gt;Kiro IDE&lt;/li&gt; 
   &lt;li&gt;Cursor&lt;/li&gt; 
   &lt;li&gt;Claude Code&lt;/li&gt; 
   &lt;li&gt;Other Agent Skills-compatible tools&lt;/li&gt; 
  &lt;/ul&gt; &lt;/li&gt; 
 &lt;li&gt;Basic knowledge of Java and stream processing concepts (helpful but not required)&lt;/li&gt; 
 &lt;li&gt;An &lt;a href="https://aws.amazon.com/iam/" target="_blank" rel="noopener noreferrer"&gt;AWS Identity and Access Management&lt;/a&gt; (IAM) role configured with access to create and run Managed Service for Apache Flink applications, create Amazon Simple Storage Service (Amazon S3) buckets for Flink application dependencies, create Kinesis Data Streams for streaming, and create IAM roles (required if deploying an application)&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;Installation&lt;/h2&gt; 
&lt;h3&gt;Installing as a Kiro Power&lt;/h3&gt; 
&lt;ol&gt; 
 &lt;li&gt;Open Kiro IDE.&lt;/li&gt; 
 &lt;li&gt;Open &lt;a href="https://kiro.dev/launch/powers/amazon-managed-service-for-apache-flink-power/"&gt;Amazon Managed Service for Apache Flink&lt;/a&gt; and select&amp;nbsp;&lt;strong&gt;Open in Kiro.&lt;/strong&gt;&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90620" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/27/BDB-5913-open-in-kiro.png" alt="" width="2896" height="1560"&gt;&lt;/p&gt; 
&lt;ol start="3"&gt; 
 &lt;li&gt;Choose&amp;nbsp;&lt;strong&gt;Install&lt;/strong&gt; to install the power.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90621" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/27/BDB-5913-install-power.png" alt="" width="2238" height="1064"&gt;&lt;/p&gt; 
&lt;ol start="4"&gt; 
 &lt;li&gt;Verify that the power is listed in the installed powers in the Kiro IDE.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90634" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/27/BDB-5913-installed-power-1.png" alt="" width="2736" height="1158"&gt;&lt;/p&gt; 
&lt;p&gt;The Power is now installed and automatically activates when you work on Flink-related development activities.&lt;/p&gt; 
&lt;h3&gt;Installing as an Agent Skill&lt;/h3&gt; 
&lt;p&gt;Agent Skills are discovered automatically by compatible tools through the SKILL.md file. Installation varies by tool:&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Per-project installation&lt;/strong&gt; (available in one project):&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;# For Cursor
git clone https://github.com/awslabs/managed-service-for-apache-flink-agent-steering-files.git .cursor/skills/flink

# For Claude Code
git clone https://github.com/awslabs/managed-service-for-apache-flink-agent-steering-files.git .claude/skills/flink

# For other Agent Skills-compatible tools
git clone https://github.com/awslabs/managed-service-for-apache-flink-agent-steering-files.git .agents/skills/flink&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;&lt;strong&gt;Personal installation&lt;/strong&gt; (available across projects):&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;# For Cursor
git clone https://github.com/awslabs/managed-service-for-apache-flink-agent-steering-files.git ~/.cursor/skills/flink

# For Claude Code
git clone https://github.com/awslabs/managed-service-for-apache-flink-agent-steering-files.git ~/.claude/skills/flink&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;To verify the installation, interact with the skill in your preferred tool. In Claude Code, you can invoke it with /flink. In Cursor, type / in Agent chat and search for flink. For more information about Agent Skills, see the &lt;a href="https://agentskills.io/home" target="_blank" rel="noopener noreferrer"&gt;Agent Skills documentation&lt;/a&gt;.&lt;/p&gt; 
&lt;h2&gt;Example: Building a Kinesis-to-Kinesis streaming pipeline&lt;/h2&gt; 
&lt;p&gt;Rather than listing best practices, the Power/Skill actively guides you through making the right architectural decisions at each stage of development.&lt;/p&gt; 
&lt;p&gt;The following walkthrough demonstrates building a Flink application that reads from &lt;a href="https://aws.amazon.com/kinesis/" target="_blank" rel="noopener noreferrer"&gt;Amazon Kinesis Data Streams&lt;/a&gt;, analyzes events, and writes to another Kinesis stream. To follow along, run the same prompts in your Kiro IDE or other development tool. In the following prompts, we focus on local development and don’t create AWS resources. However, if you prompt the agent to create and deploy AWS resources, they will incur additional costs.&lt;/p&gt; 
&lt;h3&gt;Starting the conversation&lt;/h3&gt; 
&lt;p&gt;In the Kiro IDE, we can open a new chat in Vibe mode and prompt: &lt;em&gt;“Help me build a Flink application that reads from Kinesis, processes events with windowed aggregations, and writes results to another Kinesis stream”:&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone wp-image-90465 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/BDB-5913-5.png" alt="Kiro chat showing a prompt to build a Kinesis streaming application" width="2066" height="1626"&gt;&lt;/p&gt; 
&lt;h3&gt;What happens next&lt;/h3&gt; 
&lt;p&gt;The AI assistant loads relevant guidance and walks you through the development process:&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;1. Confirm project requirements and details&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;Kiro automatically loads the Power based on the context of your prompt. The assistant then asks you questions about your use case to make sure that it builds the right application for your needs:&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90637" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/27/BDB-5913-6-1.png" alt="" width="1304" height="2212"&gt;&lt;/p&gt; 
&lt;p&gt;For the demo, we can prompt for a financial services use case: &lt;em&gt;“I’m in financial services, so let’s use that as the use case. Try calculating volatility in real-time. And let’s use Flink 1.20 for now.”. &lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;Kiro then confirms its assumptions and asks to proceed:&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90467" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/BDB-5913-7.png" alt="" width="1982" height="934"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;2. Project setup&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;After we confirm, Kiro generates a project with Flink 1.20 dependencies, Kinesis connectors, and proper scope configuration for Managed Service for Apache Flink deployment. The assistant creates the application structure with proper configuration separation between local development and Managed Service for Apache Flink service-level settings. Then, it creates a Kinesis source with proper deserialization and the sink with partitioning strategy, and windowed aggregation logic with proper &lt;a href="https://docs.aws.amazon.com/managed-flink/latest/java/troubleshooting-checkpoints.html" target="_blank" rel="noopener noreferrer"&gt;state management&lt;/a&gt;, &lt;a href="https://docs.aws.amazon.com/managed-flink/latest/java/troubleshooting-rt-stateleaks.html" target="_blank" rel="noopener noreferrer"&gt;TTL configuration&lt;/a&gt;, and &lt;a href="https://docs.aws.amazon.com/managed-flink/latest/java/troubleshooting-source-throttling.html" target="_blank" rel="noopener noreferrer"&gt;error handling&lt;/a&gt;.&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone wp-image-90468 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/BDB-5913-8.png" alt="Generated project structure with Flink dependencies and Kinesis connectors" width="1974" height="1788"&gt;&lt;/p&gt; 
&lt;p&gt;Kiro also compiles the code to verify that it builds correctly. We can then proceed by asking Kiro to help us with running the application locally for testing.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;3. Testing the project locally&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;You can run the application locally to test the results. We can prompt: &lt;em&gt;“Can we run this locally using something like LocalStack to test deploying the job and also see some example results?”&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;Kiro creates the necessary Docker resources, testing scripts, and deployment steps to run the application locally with synthetic resources. If it encounters bugs or detects issues during the local testing process, it fixes them so that your deployment runs smoothly:&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone wp-image-90469 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/BDB-5913-9.png" alt="Kiro creating Docker resources and local testing infrastructure" width="1464" height="1656"&gt;&lt;/p&gt; 
&lt;p&gt;We can also access our local Flink UI to view our application:&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone wp-image-90470 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/BDB-5913-10.png" alt="Local Flink UI showing the running streaming application" width="3204" height="1898"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;4. Deploying the application to Managed Service for Apache Flink&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;Now that our application is running and generating results end-to-end, we can use the Power for other tasks. For example, you can get guidance on KPU allocation and parallelism settings based on your expected throughput, configure monitoring with CloudWatch metrics, logging, and dashboards for operational visibility, or set up infrastructure as code (IaC) for deploying in Managed Service for Apache Flink. We can prompt: &lt;em&gt;“This is great! Can you help me deploy this application to Managed Service for Apache Flink? I’d like to use CloudFormation for deployment.”&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone wp-image-90471 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/BDB-5913-11.png" alt="Kiro conversation summarizing creation of CloudFormation deployment resources" width="1770" height="1716"&gt;&lt;/p&gt; 
&lt;p&gt;Using the generated &lt;a href="https://aws.amazon.com/cloudformation/" target="_blank" rel="noopener noreferrer"&gt;AWS CloudFormation&lt;/a&gt; templates and deployment scripts, we can deploy our application to AWS with associated resources for Kinesis Data Streams, &lt;a href="https://aws.amazon.com/s3/" target="_blank" rel="noopener noreferrer"&gt;Amazon S3&lt;/a&gt; buckets for application JAR files, CloudWatch log groups, and IAM roles. Deploying these resources requires IAM credentials with associated permissions and will incur cost for the associated resource usage.&lt;/p&gt; 
&lt;p&gt;In a traditional workflow, you build your application, deploy to Managed Service for Apache Flink, then discover performance issues or configuration problems in production. You spend time debugging checkpoint failures, serialization errors, or resource bottlenecks.With the Power/Skill, the AI assistant catches these issues during development. When you need complex aggregation and processing logic, it helps you to do so in a way that uses resources efficiently with Flink’s scaling model. When you create an application bug that would cause a crash in production, it helps you identify it early with local end-to-end testing. The Power is configured with guidance and best practices to help with the development process from start to finish.&lt;/p&gt; 
&lt;h2&gt;Example: Migrating to Flink 2.2&lt;/h2&gt; 
&lt;p&gt;The Managed Service for Apache Flink Kiro Power and Agent Skill provide contextual advice specific to your situation. For new developers, it walks through the complete workflow from project setup to deployment, explaining Managed Service for Apache Flink-specific concepts along the way. For migration projects, it analyzes your existing code for Flink 2.2 compatibility issues and provides targeted refactoring guidance. The following example shows how the tool helps with the complex task of migrating from Flink 1.x to 2.2.&lt;/p&gt; 
&lt;h3&gt;1. Assessing migration compatibility&lt;/h3&gt; 
&lt;p&gt;We can ask Kiro to help us upgrade our project from the previous example to Flink 2.2&lt;em&gt;: “I need to migrate my Flink 1.x application to 2.2. Can you help me identify compatibility issues?”&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;The assistant loads the Managed Service for Apache Flink Kiro Power and analyzes our code to identify potential issues:&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone wp-image-90472 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/BDB-5913-12.png" alt="Kiro analyzing Flink 1.x code for 2.2 compatibility issues" width="2322" height="1738"&gt;&lt;/p&gt; 
&lt;p&gt;In this case, using our generated project on Flink 1.20, Kiro identified the following compatibility issues for the upgrade:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Java 11 must move to Java 17 (minimum for Flink 2.2)&lt;/li&gt; 
 &lt;li&gt;Flink version 1.20.3 must update to 2.2.0&lt;/li&gt; 
 &lt;li&gt;The Kinesis connector must update from 5.1.0-1.20 to 6.0.0-2.0&lt;/li&gt; 
 &lt;li&gt;Time references must change to java.time.Duration in window and lateness calls&lt;/li&gt; 
 &lt;li&gt;The LocalStreamEnvironment instance of check must be removed (class removed in 2.2)&lt;/li&gt; 
 &lt;li&gt;The isEndOfStream() override must be dropped from PriceTickDeserializer (method removed)&lt;/li&gt; 
 &lt;li&gt;implements Serializable must be added to PriceTick and VolatilityResult&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;It also verified that some parts of the project are already Flink 2.2 compatible. The project uses the new Source Sink V2 APIs, the logging is 2.2 ready, the POJOs with no collection fields are state migration safe, and there are no Kryo registrations or TimeCharacteristic usage.&lt;/p&gt; 
&lt;h3&gt;2. Implementing the migration&lt;/h3&gt; 
&lt;p&gt;We can then ask Kiro to provide a step-by-step migration plan, both for updating the code and deploying to Managed Service for Apache Flink: &lt;em&gt;“Can you help me update the application for Flink 2.2, and help me figure out the steps to upgrade my running Managed Service for Apache Flink application?”&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;Kiro evaluates the entire application code base. It evaluates it against the Power’s migration guidance and best practices, and provides a comprehensive analysis of the breaking changes, risks, and potential issues that would arise in the upgrade. After we approve the changes, Kiro then proceeds to make the necessary updates to make our application compatible with Flink 2.2 and provide us with a step-by-step upgrade process for the running application:&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone wp-image-90473 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/BDB-5913-13.png" alt="Kiro providing a step-by-step migration plan for Flink 2.2" width="2488" height="1592"&gt;&lt;/p&gt; 
&lt;p&gt;Now that Kiro has prepared the application for Flink 2.2, highlighted migration risks, and provided us with a clear path to execute the upgrade, you can test the upgrade process with confidence. From here, we can proceed to run our Flink 2.2 application locally, test the upgrade process in a development environment in Managed Service for Apache Flink, and then execute the upgrade in our production environment. If we run into issues, we can return to the Kiro Power to get advice, resolve issues, and unblock our upgrade.&lt;/p&gt; 
&lt;h2&gt;Cleanup&lt;/h2&gt; 
&lt;p&gt;To remove the Power/Skill installation:&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;For Kiro:&lt;/strong&gt;&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;Open Kiro IDE.&lt;/li&gt; 
 &lt;li&gt;Navigate to the &lt;strong&gt;Powers&lt;/strong&gt; tab.&lt;/li&gt; 
 &lt;li&gt;Uninstall the &lt;strong&gt;Amazon Managed Service for Apache Flink&lt;/strong&gt; Power.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;strong&gt;For Agent Skills:&lt;/strong&gt;&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;# Remove per-project installation
rm -rf .cursor/skills/flink  # or .claude/skills/flink

# Remove personal installation
rm -rf ~/.cursor/skills/flink  # or ~/.claude/skills/flink
If you created Managed Service for Apache Flink applications or associated resources during development, clean the resources up:&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;ol start="4"&gt; 
 &lt;li&gt;Delete the Managed Service for Apache Flink application from the AWS Console.&lt;/li&gt; 
 &lt;li&gt;Remove associated resources for sources and sinks, if created for development.&lt;/li&gt; 
 &lt;li&gt;Delete CloudWatch log groups if no longer needed.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;h2&gt;Conclusion&lt;/h2&gt; 
&lt;p&gt;In this post, we showed you how the Kiro Power and Agent Skill for Amazon Managed Service for Apache Flink brings AI-assisted development to stream processing. You can use the tool to overcome Flink’s learning curve, build applications following Managed Service for Apache Flink best practices, and migrate to Flink 2.2 with confidence. To get started, choose the path that fits your workflow:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;If you use Kiro, install the Power from the Powers tab and start a new chat with a Flink-related prompt.&lt;/li&gt; 
 &lt;li&gt;If you use Cursor, Claude Code, or another Agent Skills-compatible tool, clone the &lt;a href="https://github.com/aws-samples/kiro-powers-apache-flink" target="_blank" rel="noopener noreferrer"&gt;GitHub repository&lt;/a&gt; into your skills directory and reference the steering/ files for guidance.&lt;/li&gt; 
 &lt;li&gt;If you are new to Amazon Managed Service for Apache Flink, review the &lt;a href="https://docs.aws.amazon.com/managed-flink/latest/java/what-is.html" target="_blank" rel="noopener noreferrer"&gt;Amazon Managed Service for Apache Flink Developer Guide&lt;/a&gt; and the &lt;a href="https://nightlies.apache.org/flink/flink-docs-stable/" target="_blank" rel="noopener noreferrer"&gt;Apache Flink documentation&lt;/a&gt; to build foundational knowledge alongside the Power/Skill.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;We welcome your feedback. Report issues or request features through &lt;a href="https://github.com/awslabs/managed-service-for-apache-flink-agent-steering-files/issues" target="_blank" rel="noopener noreferrer"&gt;GitHub Issues&lt;/a&gt;, or contribute improvements via pull requests.&lt;/p&gt; 
&lt;hr style="width: 80%"&gt; 
&lt;h2&gt;About the authors&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignnone size-full wp-image-89475" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/03/24/bdb-5775-mmehrten-headshot.png" alt="" width="100" height="107"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Mazrim Mehrtens&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/mmehrtens/" target="_blank" rel="noopener"&gt;Mazrim&lt;/a&gt; is a Sr. Specialist Solutions Architect for messaging and streaming workloads. Mazrim works with customers to build and support systems that process and analyze terabytes of streaming data in real time, run enterprise Machine Learning pipelines, and create systems to share data across teams seamlessly with varying data toolsets and software stacks.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
					
			
		
		
			</item>
		<item>
		<title>Migrating TLS Clients managed by third-party Certificate Authorities from self-managed Apache Kafka to Amazon MSK</title>
		<link>https://aws.amazon.com/blogs/big-data/migrating-tls-clients-managed-by-third-party-certificate-authorities-from-self-managed-apache-kafka-to-amazon-msk/</link>
					
		
		<dc:creator><![CDATA[Ali Alemi]]></dc:creator>
		<pubDate>Wed, 06 May 2026 15:41:21 +0000</pubDate>
				<category><![CDATA[Amazon Managed Streaming for Apache Kafka (Amazon MSK)]]></category>
		<category><![CDATA[Analytics]]></category>
		<category><![CDATA[Intermediate (200)]]></category>
		<guid isPermaLink="false">125f60f817573749f223950d8146ad840813a80a</guid>

					<description>In this post, we provide an approach to reuse your existing client certificates without reissuing them through AWS Certificate Manager (ACM) Private Certificate Authority. This solution enables an accelerated migration path by using your current third-party CA infrastructure. This removes the complexity and operational overhead of certificate re-issuance while maintaining the security posture that you've established with your existing mTLS implementation.</description>
										<content:encoded>&lt;p&gt;&lt;a href="https://aws.amazon.com/msk/" target="_blank" rel="noopener noreferrer"&gt;Amazon Managed Streaming for Apache Kafka (Amazon MSK)&lt;/a&gt; is a fully managed streaming data service that handles &lt;a href="https://kafka.apache.org/" target="_blank" rel="noopener noreferrer"&gt;Apache Kafka&lt;/a&gt; infrastructure and operations, so developers and DevOps managers can run Apache Kafka applications on AWS. Migrating to Amazon MSK requires no application code changes because Amazon MSK uses fully open source Apache Kafka, allowing existing applications and tools to work seamlessly. &lt;a href="https://docs.aws.amazon.com/msk/latest/developerguide/msk-broker-types-express.html" target="_blank" rel="noopener noreferrer"&gt;Amazon MSK with Express brokers&lt;/a&gt; streamlines Kafka management by providing up to 3x more throughput, 20x faster scaling, and 180x faster recovery with virtually unlimited storage, delivering resiliency and elasticity for mission-critical workloads.&lt;/p&gt; 
&lt;p&gt;&lt;a href="https://docs.aws.amazon.com/msk/latest/developerguide/kafka_apis_iam.html" target="_blank" rel="noopener noreferrer"&gt;Amazon MSK supports multiple authentication methods&lt;/a&gt; to secure client connections to Kafka clusters. These methods include:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/msk/latest/developerguide/iam-access-control.html" target="_blank" rel="noopener noreferrer"&gt;IAM authentication&lt;/a&gt; for identity-based access control using AWS Identity and Access Management (IAM) policies.&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/msk/latest/developerguide/msk-authentication.html" target="_blank" rel="noopener noreferrer"&gt;Mutual TLS (mTLS) authentication&lt;/a&gt; where both clients and brokers authenticate each other using X.509 certificates.&lt;/li&gt; 
 &lt;li&gt;&lt;a href="https://docs.aws.amazon.com/msk/latest/developerguide/msk-password.html" target="_blank" rel="noopener noreferrer"&gt;SASL/SCRAM authentication&lt;/a&gt; for username and password-based authentication stored in AWS Secrets Manager.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;When customers manage their own Kafka clusters and adopt mTLS, they typically rely on a third-party managed certificate authority (CA) to sign and verify both client and server certificates. This establishes a trust relationship where the CA acts as the trusted intermediary that validates the identity of both parties in the communication. When customers migrate their workloads to Amazon MSK, they must make sure that client certificates are signed by a CA that’s recognized and trusted by the MSK cluster. Amazon MSK recommends customers to use &lt;a href="https://docs.aws.amazon.com/privateca/latest/userguide/PcaWelcome.html" target="_blank" rel="noopener noreferrer"&gt;AWS Private Certificate Authority&lt;/a&gt; to create a private CA within AWS that MSK trusts. The migration path typically requires customers to either:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;Generate new client certificates signed by an AWS Private CA that Amazon MSK recognizes, or&lt;/li&gt; 
 &lt;li&gt;Establish a certificate chain where their existing third-party CA is subordinate to or trusted by an AWS-managed CA&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;In this post, we provide an approach to reuse your existing client certificates without reissuing them through AWS Certificate Manager (ACM) Private Certificate Authority. This solution enables an accelerated migration path by using your current third-party CA infrastructure. This removes the complexity and operational overhead of certificate re-issuance while maintaining the security posture that you’ve established with your existing mTLS implementation.&lt;/p&gt; 
&lt;h2&gt;Solution overview&lt;/h2&gt; 
&lt;p&gt;This approach involves four key steps to reuse your existing client certificates when migrating to Amazon MSK:&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;1. Create an Intermediate Certificate Using Your Third-Party CA&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;First, you generate an intermediate certificate authority (CA) certificate using your existing third-party CA infrastructure. This intermediate certificate acts as a bridge between your current certificate management system and AWS.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;2. Import the Intermediate Certificate into AWS Certificate Manager as a Private CA&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;Next, you import this intermediate certificate into AWS Certificate Manager (ACM) as a Private Certificate Authority (PCA). This step establishes the intermediate CA within the AWS environment, making it recognizable to AWS services.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;3. Integrate Amazon MSK with the PCA created from your Intermediate Certificate&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;You then configure your Amazon MSK cluster to use the ACM Private CA that contains your imported intermediate certificate. This integration enables Amazon MSK to recognize and trust certificates signed by your certificate authority.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;4. Establish trust through common Certificate Authority&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;This approach works because both the AWS Private CA and your existing client certificates share the same root of trust—they’re both signed by your third-party CA. When Amazon MSK validates client certificates, it can trace the certificate chain back through the intermediate certificate in AWS Private CA to your trusted third-party CA, establishing a complete chain of trust without requiring certificate reissuance.This solution maintains your existing security architecture while enabling seamless migration to Amazon MSK, so your clients can continue using their current certificates without interruption.&lt;/p&gt; 
&lt;div id="attachment_90636" style="width: 1100px" class="wp-caption aligncenter"&gt;
 &lt;img aria-describedby="caption-attachment-90636" loading="lazy" class="wp-image-90636 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/27/image-1-10.png" alt="" width="1090" height="668"&gt;
 &lt;p id="caption-attachment-90636" class="wp-caption-text"&gt;Figure 1: Architecture diagram showing the integration of third-party Certificate Authority with Amazon MSK through AWS Certificate Manager Private CA&lt;/p&gt;
&lt;/div&gt; 
&lt;h2&gt;Implementation steps&lt;/h2&gt; 
&lt;p&gt;In real-world scenarios, you already have a certificate authority that has issued certificates for your clients. For the purpose of this post, we use a &lt;a href="https://github.com/aws-samples/msk-third-party-mtls" target="_blank" rel="noopener noreferrer"&gt;code sample&lt;/a&gt; to create a self-signed certificate authority (using OpenSSL) to demonstrate the implementation steps. If you already have an existing certificate authority, you don’t need to create a root CA. You can generate an intermediate CA (Step 2) using your third-party CA and continue following the steps from where you import the intermediate CA certificate into AWS ACM as a Private Certificate Authority.&lt;/p&gt; 
&lt;h3&gt;&lt;strong&gt;Step 1: Create a root Certificate Authority using OpenSSL&lt;/strong&gt;&lt;/h3&gt; 
&lt;p&gt;&lt;strong&gt;Cloning the repository&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;To clone the repository, complete the following steps:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;&lt;strong&gt;Clone the repository&lt;/strong&gt; using the following command:&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;code&gt;git clone https://github.com/aws-samples/msk-third-party-mtls&lt;/code&gt;&lt;/p&gt; 
&lt;ol start="2"&gt; 
 &lt;li&gt;&lt;strong&gt;Change to the repository’s root directory&lt;/strong&gt;:&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;code&gt;cd ./msk-third-party-mtls/openssl&lt;/code&gt;&lt;/p&gt; 
&lt;ol start="3"&gt; 
 &lt;li&gt;&lt;strong&gt;Run the setup script&lt;/strong&gt;:&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;make the script executable first:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-shell"&gt;chmod +x *.sh
./setup-ca.sh&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;You will be prompted to set up a password for the private key and the certificate. Here is an example of an output&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="aligncenter wp-image-90635 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/27/image-2-8.png" alt="" width="1020" height="589"&gt;&lt;/p&gt; 
&lt;h3&gt;&lt;strong&gt;Step 2: Create an intermediate CA for AWS ACM&lt;/strong&gt;&lt;/h3&gt; 
&lt;ol&gt; 
 &lt;li&gt;In the AWS Private CA console, create a subordinate CA.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img loading="lazy" class="aligncenter wp-image-90633 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/27/image-3-7.png" alt="" width="824" height="510"&gt;&lt;/p&gt; 
&lt;ol start="2"&gt; 
 &lt;li&gt;Enter distinguished name information matching your organization, Key algorithm and Create CA.&lt;/li&gt; 
 &lt;li&gt;From the &lt;strong&gt;Actions&lt;/strong&gt; menu, select &lt;strong&gt;Install CA certificate&lt;/strong&gt;.&lt;/li&gt; 
 &lt;li&gt;Download the Certificate Signing Request (CSR) file provided by AWS Private CA.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img loading="lazy" class="aligncenter wp-image-90632 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/27/image-4-7.png" alt="" width="1432" height="573"&gt;&lt;/p&gt; 
&lt;ol start="5"&gt; 
 &lt;li&gt;Download the CSR file to your local directory (“certs”) as “CSR.pem”.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img loading="lazy" class="aligncenter wp-image-90631 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/27/image-5-7.png" alt="" width="1251" height="895"&gt;&lt;/p&gt; 
&lt;ol start="6"&gt; 
 &lt;li&gt;Sign the ACM PCA issued CSR with your Root CA using the provided &lt;code&gt;./sign-acm-ca.sh&lt;/code&gt; in the code example.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img loading="lazy" class="aligncenter wp-image-90630 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/27/image-6-4.png" alt="" width="1126" height="497"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; AWS Private CA retains the private key internally. You only sign their CSR and import the resulting certificate back to the AWS Private CA.&lt;/p&gt; 
&lt;h3&gt;&lt;strong&gt;Step 3: Import signed certificate to AWS ACM Private CA&lt;/strong&gt;&lt;/h3&gt; 
&lt;ol&gt; 
 &lt;li&gt;Go back to the AWS ACM console.&lt;/li&gt; 
 &lt;li&gt;Select the CA that you created and select &lt;strong&gt;Install CA certificate&lt;/strong&gt;.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img loading="lazy" class="wp-image-90629 size-full alignnone" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/27/image-7-4.png" alt="" width="330" height="366"&gt;&lt;/p&gt; 
&lt;ol start="3"&gt; 
 &lt;li&gt;Select External private CA as CA type.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img loading="lazy" class="aligncenter wp-image-90628 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/27/image-8-4.png" alt="" width="1433" height="192"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Importing the certificate into AWS Certificate Manager&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="aligncenter wp-image-90627 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/27/image-9-3.png" alt="" width="819" height="671"&gt;&lt;/p&gt; 
&lt;p&gt;Open both files in a text editor:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;code&gt;acm-subordinate-ca-cert.pem&lt;/code&gt;&lt;/li&gt; 
 &lt;li&gt;&lt;code&gt;acm-ca-chain.pem&lt;/code&gt;&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;Do the following in the &lt;strong&gt;Certificate body&lt;/strong&gt; field in AWS ACM:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Copy the &lt;strong&gt;entire content&lt;/strong&gt; from the &lt;code&gt;acm-subordinate-ca-cert.pem&lt;/code&gt; file and paste it into the text box.&lt;/li&gt; 
 &lt;li&gt;Open the &lt;code&gt;acm-ca-chain.pem&lt;/code&gt; file.&lt;/li&gt; 
 &lt;li&gt;This file contains &lt;strong&gt;one certificate&lt;/strong&gt; (The root CA certificate)&lt;/li&gt; 
 &lt;li&gt;Do the following in the &lt;strong&gt;Certificate chain&lt;/strong&gt; field in AWS ACM:&lt;/li&gt; 
 &lt;li&gt;Copy&lt;strong&gt; the root CA certificate portion and p&lt;/strong&gt;aste it into the text box&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;&lt;strong&gt;Important:&lt;/strong&gt; The certificate chain shouldn’t include the subordinate CA certificate itself—only the certificates above it in the chain (the root CA).&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Choose &lt;strong&gt;Confirm and install&lt;/strong&gt; to complete the process.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;You should see the AWS Private CA turns into active state.&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="aligncenter wp-image-90626 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/27/image-10-3.png" alt="" width="1430" height="367"&gt;&lt;/p&gt; 
&lt;h3&gt;Step 4: Configure your MSK cluster for Mutual TLS authentication&lt;/h3&gt; 
&lt;ol&gt; 
 &lt;li&gt;Select your MSK cluster, go to&amp;nbsp;&lt;strong&gt;Properties&lt;/strong&gt;&amp;nbsp;and edit the&amp;nbsp;&lt;strong&gt;Security settings&lt;/strong&gt;.&lt;/li&gt; 
 &lt;li&gt;Select&amp;nbsp;&lt;strong&gt;TLS client authentication through AWS Certificate Manager (ACM)&lt;/strong&gt;&amp;nbsp;as the access control method and choose the Subordinate CA that you created earlier. Then choose &lt;strong&gt;Save changes&lt;/strong&gt;.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;img loading="lazy" class="aligncenter wp-image-90625 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/27/image-11-3.png" alt="" width="798" height="485"&gt;&lt;/p&gt; 
&lt;h3&gt;Step 5: Test your client&lt;/h3&gt; 
&lt;p&gt;&lt;strong&gt;Run the certificate generation script&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;Execute the following command, replacing &amp;lt;client-name&amp;gt; with a descriptive name for your client (this will be used in the certificate filename):&lt;code&gt;./generate-client-cert.sh &amp;lt;client-name&amp;gt;&lt;/code&gt;&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Example:&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;&lt;code&gt;./generate-client-cert.sh kafka-admin&lt;/code&gt;&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Enter distinguished name information&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;When prompted, enter the distinguished name (DN) options. These should &lt;strong&gt;match your root CA settings&lt;/strong&gt; except for the Common Name (CN):&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Country (C):&lt;/strong&gt; Match your root CA (for example, US)&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;State (ST):&lt;/strong&gt; Match your root CA (for example, State)&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Organization (O):&lt;/strong&gt; Match your root CA (for example, Anycompany)&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Organizational Unit (OU):&lt;/strong&gt; Match your root CA (for example, IT)&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Common Name (CN):&lt;/strong&gt; Use a &lt;strong&gt;client-specific identifier&lt;/strong&gt; (for example, kafka-admin or client)&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;&lt;strong&gt;Verify certificate files&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;After the certificate is generated, verify that the files were created successfully by running:&lt;code&gt;ls ~/ca/certs&lt;/code&gt;You should see files with your client name, including:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;code&gt;&amp;lt;client-name&amp;gt;.key&lt;/code&gt; (private key)&lt;/li&gt; 
 &lt;li&gt;&lt;code&gt;&amp;lt;client-name&amp;gt;.crt&lt;/code&gt; (certificate)&lt;/li&gt; 
 &lt;li&gt;&lt;code&gt;&amp;lt;client-name&amp;gt;.p12&lt;/code&gt; (PKCS12 keystore)&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;&lt;strong&gt;Create Kafka client properties file&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;Create a new properties file for your Kafka client (for example, &lt;code&gt;kafka-tls-client.properties&lt;/code&gt;) based on the provided &lt;code&gt;kafka-admin-ssl.properties&lt;/code&gt; example file. Update the file paths to reference your newly generated client certificate files.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Example configuration:&lt;/strong&gt;&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;security.protocol=SSL
ssl.keystore.location=/path/to/&amp;lt;client-name&amp;gt;.p12
ssl.keystore.password=your-keystore-password
ssl.key.password=your-key-password #omit if you didn’t set key password
ssl.keystore.alias=your-private-key-alias&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;h3&gt;Step 6: Testing the Kafka client connection&lt;/h3&gt; 
&lt;p&gt;To test the Kafka client connection, do the following.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Set environment variables&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;First, set the required environment variables for your Kafka installation and MSK cluster:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-javascript"&gt;export KAFKA_HOME=/home/ec2-user/kafka
export BOOTSTRAP_SERVERS=&amp;lt;your-msk-bootstrap-servers&amp;gt;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Replace &amp;lt;your-msk-bootstrap-servers&amp;gt; with your actual Amazon MSK cluster bootstrap server endpoints (for example, b-1.mycluster.abc123.kafka.us-east-1.amazonaws.com:9094,b-2.mycluster.abc123.kafka.us-east-1.amazonaws.com:9094)&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Run the Kafka list topics command&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;Execute the following command to verify that your client can successfully connect to Amazon MSK using mutual TLS authentication:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-code"&gt;$KAFKA_HOME/bin/kafka-topics.sh \
  --bootstrap-server $BOOTSTRAP_SERVERS \
  --list \
  --command-config kafka-tls-client.properties&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;&lt;strong&gt;What this test does:&lt;/strong&gt;&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Connects to your Amazon MSK cluster using the TLS configuration in your properties file&lt;/li&gt; 
 &lt;li&gt;Authenticates using your client certificate&lt;/li&gt; 
 &lt;li&gt;Lists all available Kafka topics&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;&lt;strong&gt;Expected result:&lt;/strong&gt; If successful, you should see a list of topics in your Kafka cluster (or an empty list if no topics exist yet).&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="aligncenter wp-image-90624 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/27/image-12-2.png" alt="" width="1284" height="243"&gt;&lt;/p&gt; 
&lt;p&gt;If the connection fails, check:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Your bootstrap server endpoints are correct&lt;/li&gt; 
 &lt;li&gt;You imported the private key, and certificate chain to your keystore&lt;/li&gt; 
 &lt;li&gt;The paths in your properties file point to the correct keystore and truststore files&lt;/li&gt; 
 &lt;li&gt;Your client certificate was properly imported&lt;/li&gt; 
 &lt;li&gt;Your Amazon MSK cluster security settings allow TLS client authentication&lt;/li&gt; 
 &lt;li&gt;Your Amazon MSK cluster references correct PCA ARN in AWS ACM&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;Troubleshooting&lt;/h2&gt; 
&lt;h3&gt;Enable debug mode to verify certificate handshake&lt;/h3&gt; 
&lt;p&gt;To troubleshoot certificate issues and verify which certificates are involved in the TLS handshake, enable Java SSL debug mode:&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-javascript"&gt;export KAFKA_OPTS="-Djavax.net.debug=ssl:handshake:verbose"
$KAFKA_HOME/bin/kafka-topics.sh \
  --bootstrap-server $BOOTSTRAP_SERVERS \
  --list \
  --command-config kafka-tls-client.properties&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;&lt;strong&gt;What this debug mode shows:&lt;/strong&gt;&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;The complete TLS handshake process&lt;/li&gt; 
 &lt;li&gt;Which certificates are being presented by both client and server&lt;/li&gt; 
 &lt;li&gt;The certificate chain validation steps&lt;/li&gt; 
 &lt;li&gt;Which certificate from your truststore is being used for authentication&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;&lt;img loading="lazy" class="aligncenter wp-image-90623 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/27/image-13-3.png" alt="" width="1270" height="474"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;When this is helpful:&lt;/strong&gt;&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;When you have multiple certificates in your truststore and need to identify which one is being used&lt;/li&gt; 
 &lt;li&gt;When troubleshooting certificate chain validation issues&lt;/li&gt; 
 &lt;li&gt;When verifying that the correct client certificate is being presented during authentication&lt;/li&gt; 
 &lt;li&gt;When diagnosing certificate mismatch or trust issues&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h3&gt;&lt;strong&gt;Reading the debug output:&lt;/strong&gt;&lt;/h3&gt; 
&lt;p&gt;Look for lines containing:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;***Certificate chain – Shows the certificates being presented&lt;/li&gt; 
 &lt;li&gt;Found trusted certificate – Indicates which certificate in your truststore matched&lt;/li&gt; 
 &lt;li&gt;Cert path validation – Shows the certificate chain validation process&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;&lt;strong&gt;To disable debug mode&lt;/strong&gt; after troubleshooting, simply unset the environment variable:&lt;/p&gt; 
&lt;p&gt;&lt;code&gt;unset KAFKA_OPTS&lt;/code&gt;&lt;/p&gt; 
&lt;h2&gt;Conclusion&lt;/h2&gt; 
&lt;p&gt;This post presents a solution for migrating TLS clients from self-managed Apache Kafka to Amazon MSK while reusing existing third-party CA-signed certificates. The approach removes the need for certificate reissuance by instead creating an intermediate CA from the existing third-party CA, importing it into AWS Certificate Manager as a Private CA, and integrating it with Amazon MSK. This maintains the established chain of trust through the common certificate authority, enabling seamless migration without operational disruption while preserving the existing security architecture and mTLS implementation. To read more about the Amazon MSK security model, see &lt;a href="https://docs.aws.amazon.com/msk/latest/developerguide/security.html" target="_blank" rel="noopener noreferrer"&gt;Security in Amazon MSK&lt;/a&gt;.&lt;/p&gt; 
&lt;hr style="width: 80%"&gt; 
&lt;h2&gt;About the authors&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="size-full wp-image-82865 alignleft" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2025/08/31/BDB-4572-Ali-Alemi.png" alt="Author Ali Alemi" width="100" height="100"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;“Ali Alemi”&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/ali-alemi-11869b53/" target="_blank" rel="noopener"&gt;“Ali”&lt;/a&gt; is a Principal Streaming Solutions Architect at AWS. Ali advises AWS customers with architectural best practices and helps them design real-time analytics data systems which are reliable, secure, efficient, and cost-effective. Prior to joining AWS, Ali supported several public sector customers and AWS consulting partners in their application modernization journey and migration to the Cloud.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="size-full wp-image-62362 alignleft" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2024/04/10/swapnaba-pic-2.jpeg" alt="" width="81" height="108"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;“Swapna Bandla”&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/swapnabandla/" target="_blank" rel="noopener"&gt;“Swapna”&lt;/a&gt; is a Senior Streaming Solutions Architect at AWS. With a deep understanding of real-time data processing and analytics, she partners with customers to architect scalable, cloud-native solutions that align with AWS Well-Architected best practices. Swapna is passionate about helping organizations unlock the full potential of their data to drive business value. Beyond her professional pursuits, she cherishes quality time with her family.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
					
			
		
		
			</item>
		<item>
		<title>A guide to capacity planning for Airflow worker pool in Amazon MWAA</title>
		<link>https://aws.amazon.com/blogs/big-data/a-guide-to-capacity-planning-for-airflow-worker-pool-in-amazon-mwaa/</link>
		
		<dc:creator><![CDATA[Boyko Radulov]]></dc:creator>
		<pubDate>Fri, 01 May 2026 15:42:45 +0000</pubDate>
				<category><![CDATA[Amazon Managed Workflows for Apache Airflow (Amazon MWAA)]]></category>
		<category><![CDATA[Best Practices]]></category>
		<category><![CDATA[Amazon Cloudwatch]]></category>
		<guid isPermaLink="false">250b8508de241c170b26475c4624ddc662cfa423</guid>

					<description>In our previous post, A guide to Airflow worker pool optimization in Amazon MWAA, we explored when adding workers to your Amazon Managed Workflows for Apache Airflow (Amazon MWAA) environment actually solves performance issues, and when it doesn’t. We walked through patterns like high CPU utilization and long queue times where scaling may be appropriate, […]</description>
										<content:encoded>&lt;p&gt;In our previous post, &lt;a href="https://aws.amazon.com/blogs/big-data/a-guide-to-airflow-worker-pool-optimization-in-amazon-mwaa/"&gt;A guide to Airflow worker pool optimization in Amazon MWAA&lt;/a&gt;, we explored when adding workers to your Amazon Managed Workflows for Apache Airflow (Amazon MWAA) environment actually solves performance issues, and when it doesn’t. We walked through patterns like high CPU utilization and long queue times where scaling may be appropriate, and anti-patterns like misconfigured Airflow settings and memory leaks where adding workers only masks the real problem. The key takeaway was clear: optimize first, scale second, and always let data drive the decision.&lt;/p&gt; 
&lt;p&gt;But what happens after you’ve done the optimization work? Your DAGs are efficient, your configurations are tuned, and your environment is running well. Then the business comes knocking: new regulatory requirements, additional data pipelines, expanded reporting. The workload is about to grow, and this time, you genuinely need more capacity.&lt;/p&gt; 
&lt;p&gt;This is where capacity planning comes in. Knowing how many workers to provision, before the new workload hits production, is the difference between a smooth rollout and a 5 AM SLA breach. In this post, we walk through a practical capacity planning framework for Amazon MWAA worker pools. Using a real-world financial services scenario, we show how to assess your current capacity, project future needs, calculate the right number of base workers, and set up monitoring to keep your environment healthy as workloads evolve.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Scenario:&lt;/strong&gt; A financial services company needs to plan capacity for a 25% directed acyclic graph (DAG) increase to support new regulatory reporting requirements.&lt;/p&gt; 
&lt;h2&gt;&lt;strong&gt;Current vs projected state&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;The following table compares the current and expected state after adding 25% more DAGs.&lt;/p&gt; 
&lt;p&gt;&amp;nbsp;&lt;/p&gt; 
&lt;table border="1px" cellpadding="10px"&gt; 
 &lt;tbody&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;/td&gt; 
   &lt;td&gt;Metric&lt;/td&gt; 
   &lt;td&gt;Current&lt;/td&gt; 
   &lt;td&gt;Projected&lt;/td&gt; 
   &lt;td&gt;Change&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;1&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;DAGs&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;20&lt;/td&gt; 
   &lt;td&gt;25&lt;/td&gt; 
   &lt;td&gt;25%&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;2&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Peak Tasks (5-7 AM)&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;80&lt;/td&gt; 
   &lt;td&gt;104&lt;/td&gt; 
   &lt;td&gt;+24 tasks&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;3&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Environment Class&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;mw1.medium&lt;/td&gt; 
   &lt;td&gt;mw1.medium&lt;/td&gt; 
   &lt;td&gt;No change&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;4&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Base Workers&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;8&lt;/td&gt; 
   &lt;td&gt;11&lt;/td&gt; 
   &lt;td&gt;+3 workers&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;5&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Tasks per Worker&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;10 (mw1.medium default)&lt;/td&gt; 
   &lt;td&gt;10&lt;/td&gt; 
   &lt;td&gt;No change&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;6&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Available Capacity&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;80 slots (8 × 10)&lt;/td&gt; 
   &lt;td&gt;110 slots (11 × 10)&lt;/td&gt; 
   &lt;td&gt;+30 slots&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;7&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Peak Utilization&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;100% (80/80 slots) &lt;img src="https://s.w.org/images/core/emoji/14.0.0/72x72/26a0.png" alt="⚠" class="wp-smiley" style="height: 1em; max-height: 1em;"&gt;&lt;/td&gt; 
   &lt;td&gt;95% (104/110 slots)&lt;/td&gt; 
   &lt;td&gt;Improved&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;8&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Critical SLA&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;7 AM market open&lt;/td&gt; 
   &lt;td&gt;7 AM market open&lt;/td&gt; 
   &lt;td&gt;No tolerance&lt;/td&gt; 
  &lt;/tr&gt; 
 &lt;/tbody&gt; 
&lt;/table&gt; 
&lt;p&gt;&lt;strong&gt;Capacity planning goal:&lt;/strong&gt; Reduce utilization from 100% to 95% to maintain service level agreement (SLA) compliance and handle unexpected spikes.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Understanding current capacity:&lt;/strong&gt; The environment currently runs 8 base workers, providing 80 concurrent task slots (8 workers × 10 tasks per worker). During the 5-7 AM peak with 80 concurrent tasks, this represents 100% utilization, a risky level that leaves no headroom for unexpected spikes or volatility.&lt;br&gt; With the planned addition of 5 new regulatory reporting DAGs, peak concurrent tasks will grow to 104. To maintain healthy operations with adequate buffer, we need to increase to 11 base workers (110 slots), resulting in 95% peak utilization with 6 slots of breathing room.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Why 100% utilization is risky: &lt;/strong&gt;Running at 100% task utilization means:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Zero buffer for unexpected spikes&lt;/li&gt; 
 &lt;li&gt;Any additional task causes immediate queuing&lt;/li&gt; 
 &lt;li&gt;No room for market volatility or data volume increases&lt;/li&gt; 
 &lt;li&gt;High risk of SLA breaches during unpredictable events&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;&lt;strong&gt;Best practice: Maintain at least 5-15% headroom (85-95% utilization) for production workloads with critical SLAs.&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Why this sizing:&lt;/strong&gt;&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Current:&lt;/strong&gt; 80 tasks ÷ 80 slots = 100% utilization (at capacity – risky!)&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Projected:&lt;/strong&gt; 104 tasks ÷ 110 slots = 95% utilization (healthy with buffer)&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Buffer:&lt;/strong&gt; 6 slots (5% headroom) protects against unexpected volatility spikes&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;SLA protection:&lt;/strong&gt; Adequate headroom prevents queuing during normal operations&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;&lt;strong&gt;Capacity analysis&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;Every team asks the same critical question: &lt;strong&gt;“How many workers do I need&lt;/strong&gt;?” The process is to identify your peak concurrent tasks from &lt;a href="https://docs.aws.amazon.com/mwaa/latest/userguide/accessing-metrics-cw-container-queue-db.html" target="_blank" rel="noopener noreferrer"&gt;Amazon CloudWatch metrics,&lt;/a&gt; dividing by your environment’s tasks-per-worker capacity, and adding a 5%-15% safety buffer.&lt;/p&gt; 
&lt;h3&gt;&lt;strong&gt;Step 1: Identifying peak concurrent tasks from Amazon CloudWatch&lt;/strong&gt;&lt;/h3&gt; 
&lt;p&gt;To determine your peak workload, you need to analyze RunningTasks and QueuedTasks CloudWatch metrics for your Amazon MWAA environment. Navigate to Amazon CloudWatch and query the following key metrics:&lt;/p&gt; 
&lt;h4&gt;&lt;strong&gt;Primary metrics for capacity planning:&lt;/strong&gt;&lt;/h4&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;RunningTasks:&lt;/strong&gt; Number of tasks currently executing across all workers. This shows your actual concurrent task load.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;QueuedTasks:&lt;/strong&gt; Number of tasks waiting for available worker slots. High values indicate insufficient capacity.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;AvailableWorkers:&lt;/strong&gt; Current number of active workers in your environment.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;&lt;strong&gt;How to find peak concurrent tasks:&lt;/strong&gt;&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;Open the Amazon CloudWatch Console. 
  &lt;ul&gt; 
   &lt;li&gt;Choose &lt;strong&gt;Metrics&lt;/strong&gt;.&lt;/li&gt; 
   &lt;li&gt;Choose the &lt;strong&gt;MWAA &lt;/strong&gt;namespace.&lt;/li&gt; 
  &lt;/ul&gt; &lt;/li&gt; 
 &lt;li&gt;Select your environment name.&lt;/li&gt; 
 &lt;li&gt;Add the &lt;code&gt;RunningTasks&lt;/code&gt; metric.&lt;/li&gt; 
 &lt;li&gt;Set time range to last 7-30 days.&lt;/li&gt; 
 &lt;li&gt;Change statistic to &lt;strong&gt;Maximum&lt;/strong&gt;.&lt;/li&gt; 
 &lt;li&gt;Identify the highest value during your peak hours (for example, 5-7 AM).&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;strong&gt;Example query:&lt;/strong&gt;&lt;br&gt; Note: The following query is conceptual and does not directly translate to Amazon CloudWatch-specific language. Please refer to the &lt;a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/query_with_cloudwatch-metrics-insights.html" target="_blank" rel="noopener noreferrer"&gt;Query your CloudWatch metrics with CloudWatch Metrics Insights&lt;/a&gt; for more information.&lt;/p&gt; 
&lt;div class="hide-language"&gt; 
 &lt;pre&gt;&lt;code class="lang-sql"&gt;SELECT MAX(RunningTasks) AS PeakConcurrentTasks
FROM MWAA_Metrics
WHERE Environment = 'prod-airflow'
  AND timestamp BETWEEN '2024-10-01' AND '2024-10-31'
  AND HOUR(timestamp) BETWEEN 5 AND 7;&lt;/code&gt;&lt;/pre&gt; 
&lt;/div&gt; 
&lt;p&gt;In our scenario, this analysis revealed &lt;strong&gt;80 concurrent tasks&lt;/strong&gt; during the 5-7 AM window. With the planned 25% DAG increase, we project this will grow to &lt;strong&gt;104 concurrent tasks&lt;/strong&gt;.&lt;/p&gt; 
&lt;h3&gt;Step 2: Calculate required workers&lt;/h3&gt; 
&lt;p&gt;To calculate the number of required workers without queuing any tasks, use the following formula: &lt;strong&gt;Peak concurrent tasks ÷ Tasks per worker × Safety buffer = Required workers&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;In the projected scenario with 104 tasks at peak hours, using mw1.medium environment with default concurrency configuration and having a 5% safety buffer, we need 11 workers&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;104 peak tasks ÷ 10 tasks per worker × 1.06 buffer = 11 workers required to handle your workload without queuing during busiest periods.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;Capacity monitoring and triggers&lt;/h2&gt; 
&lt;p&gt;There are a few important Amazon CloudWatch metrics to monitor for environment health.&lt;/p&gt; 
&lt;h3&gt;Key metrics to monitor&lt;/h3&gt; 
&lt;p&gt;Monitor these five critical Amazon CloudWatch metrics to detect capacity issues:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;QueuedTasks (&amp;gt;10 for &amp;gt;5 minutes indicates insufficient capacity)&lt;/li&gt; 
 &lt;li&gt;RunningTasks (consistently at maximum suggests the need for more workers)&lt;/li&gt; 
 &lt;li&gt;AdditionalWorkers (active for more than 6 hours daily signals the permanent worker problem)&lt;/li&gt; 
 &lt;li&gt;Worker CPU (&amp;gt;85% sustained requires environment class upgrade or workload optimization)&lt;/li&gt; 
 &lt;li&gt;Task Duration (+15% increase means reduced effective capacity per worker).&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;These metrics provide early warning signals to adjust capacity before SLA breaches occur.&lt;/p&gt; 
&lt;p&gt;&amp;nbsp;&lt;/p&gt; 
&lt;table border="1px" cellpadding="10px"&gt; 
 &lt;tbody&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;/td&gt; 
   &lt;td&gt;Metric&lt;/td&gt; 
   &lt;td&gt;Threshold&lt;/td&gt; 
   &lt;td&gt;Action&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;1&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;QueuedTasks&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&amp;gt;10 for &amp;gt;5 minutes&lt;/td&gt; 
   &lt;td&gt;Investigate capacity&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;2&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;RunningTasks&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;Consistently at max&lt;/td&gt; 
   &lt;td&gt;Increase base workers&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;3&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;AdditionalWorkers&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;Active &amp;gt;6 hours daily&lt;/td&gt; 
   &lt;td&gt;Increase base workers&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;4&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Worker CPU&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&amp;gt;85% sustained&lt;/td&gt; 
   &lt;td&gt;Upgrade environment class&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;5&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Task Duration&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;+15% increase&lt;/td&gt; 
   &lt;td&gt;Review capacity per worker&lt;/td&gt; 
  &lt;/tr&gt; 
 &lt;/tbody&gt; 
&lt;/table&gt; 
&lt;h3&gt;Amazon CloudWatch monitoring queries&lt;/h3&gt; 
&lt;p&gt;Note: The following queries are conceptual and do not directly translate to Amazon CloudWatch-specific language. Please refer to the &lt;a href="https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/query_with_cloudwatch-metrics-insights.html" target="_blank" rel="noopener noreferrer"&gt;Query your CloudWatch metrics with CloudWatch Metrics Insights&lt;/a&gt; for more information.&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;Queue depth during peak hours 
  &lt;div class="hide-language"&gt; 
   &lt;pre&gt;&lt;code class="lang-sql"&gt;SELECT AVG(QueuedTasks)
FROM MWAA_Metrics
WHERE Environment = 'prod-airflow'
  AND timestamp BETWEEN '05:00' AND '07:00'
GROUP BY 5m;&lt;/code&gt;&lt;/pre&gt; 
  &lt;/div&gt; &lt;/li&gt; 
 &lt;li&gt;Worker utilization efficiency 
  &lt;div class="hide-language"&gt; 
   &lt;pre&gt;&lt;code class="lang-sql"&gt;SELECT AVG(RunningTasks) / AVG(AvailableWorkers * 5) * 100 AS UtilizationPercent
FROM MWAA_Metrics
WHERE Environment = 'prod-airflow';&lt;/code&gt;&lt;/pre&gt; 
  &lt;/div&gt; &lt;/li&gt; 
 &lt;li&gt;Detect permanent worker problem 
  &lt;div class="hide-language"&gt; 
   &lt;pre&gt;&lt;code class="lang-sql"&gt;SELECT DATE(timestamp) AS date,
       AVG(AdditionalWorkers) AS avg_additional,
       MAX(AdditionalWorkers) AS max_additional
FROM MWAA_Metrics
WHERE AdditionalWorkers &amp;gt; 0
GROUP BY DATE(timestamp)
HAVING AVG(AdditionalWorkers) &amp;gt; 5;&lt;/code&gt;&lt;/pre&gt; 
  &lt;/div&gt; &lt;/li&gt; 
&lt;/ul&gt; 
&lt;h3&gt;&lt;strong&gt;Setting up alerts&lt;/strong&gt;&lt;/h3&gt; 
&lt;p&gt;You can configure these alarms to identify problems as soon as they are introduced.&lt;/p&gt; 
&lt;h4&gt;&lt;strong&gt;Recommended Amazon CloudWatch alarms:&lt;/strong&gt;&lt;/h4&gt; 
&lt;ol&gt; 
 &lt;li&gt;&lt;strong&gt;High queue depth alert&lt;/strong&gt; 
  &lt;ul&gt; 
   &lt;li&gt;Metric: QueuedTasks&lt;/li&gt; 
   &lt;li&gt;Threshold: &amp;gt; 10 for 2 consecutive 5-minute periods&lt;/li&gt; 
   &lt;li&gt;Action: Notify operations team&lt;/li&gt; 
  &lt;/ul&gt; &lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Permanent worker detection&lt;/strong&gt; 
  &lt;ul&gt; 
   &lt;li&gt;Metric: AdditionalWorkers&lt;/li&gt; 
   &lt;li&gt;Threshold: &amp;gt; 0 for 6+ hours&lt;/li&gt; 
   &lt;li&gt;Action: Review capacity planning&lt;/li&gt; 
  &lt;/ul&gt; &lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;SLA risk alert&lt;/strong&gt; 
  &lt;ul&gt; 
   &lt;li&gt;Metric: QueuedTasks during 5-7 AM window&lt;/li&gt; 
   &lt;li&gt;Threshold: &amp;gt; 5 tasks&lt;/li&gt; 
   &lt;li&gt;Action: Page on-call engineer&lt;/li&gt; 
  &lt;/ul&gt; &lt;/li&gt; 
&lt;/ol&gt; 
&lt;h3&gt;&lt;strong&gt;When to revisit capacity planning&lt;/strong&gt;&lt;/h3&gt; 
&lt;p&gt;Conduct quarterly scheduled reviews to analyze trends and project growth. Also run immediate trigger-based assessments when:&lt;/p&gt; 
&lt;ul&gt; 
 &lt;li&gt;DAG count increases &amp;gt;10% (or more than your safety buffer)&lt;/li&gt; 
 &lt;li&gt;Performance degrades&lt;/li&gt; 
 &lt;li&gt;Cost anomalies appear (indicating permanent workers)&lt;/li&gt; 
 &lt;li&gt;Any SLA breach occurs.&lt;/li&gt; 
&lt;/ul&gt; 
&lt;p&gt;This dual approach provides proactive capacity management while enabling rapid response to emerging issues.&lt;/p&gt; 
&lt;p&gt;&amp;nbsp;&lt;/p&gt; 
&lt;table border="1px" cellpadding="10px"&gt; 
 &lt;tbody&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;/td&gt; 
   &lt;td&gt;Trigger&lt;/td&gt; 
   &lt;td&gt;Frequency&lt;/td&gt; 
   &lt;td&gt;Action&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;1&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Scheduled Review&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;Quarterly&lt;/td&gt; 
   &lt;td&gt;Analyze trends, project growth&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;2&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;DAG Growth&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&amp;gt;10% increase&lt;/td&gt; 
   &lt;td&gt;Recalculate capacity needs&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;3&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Performance Degradation&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;As observed&lt;/td&gt; 
   &lt;td&gt;Immediate capacity assessment&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;4&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Cost Anomalies&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;Monthly&lt;/td&gt; 
   &lt;td&gt;Check for permanent workers&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;5&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;SLA Breaches&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;Any occurrence&lt;/td&gt; 
   &lt;td&gt;Emergency capacity review&lt;/td&gt; 
  &lt;/tr&gt; 
 &lt;/tbody&gt; 
&lt;/table&gt; 
&lt;h2&gt;&lt;strong&gt;Decision matrix&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;The framework presents three capacity planning approaches, each optimized for different organizational priorities.&lt;/p&gt; 
&lt;p&gt;The &lt;strong&gt;Full Base Worker Provisioning strategy&lt;/strong&gt; (the conservative path) sets base workers equal to the calculated requirement, eliminating queue times during peak periods and guaranteeing SLA compliance with predictable fixed costs, while automatic scaling handles only unexpected spikes—ideal for mission-critical workloads with strict SLA requirements.&lt;/p&gt; 
&lt;p&gt;The &lt;strong&gt;Minimal Base + Automatic Scaling approach&lt;/strong&gt; (the cost-focused path) maintains minimal base workers at current levels and relies heavily on automatic scaling, accepting 3-5 minute delays during peak periods and SLA breach risks in exchange for lower baseline costs, though this requires intensive monitoring and carries explicit warnings about high SLA risk.&lt;/p&gt; 
&lt;p&gt;The &lt;strong&gt;Hybrid Approach&lt;/strong&gt; (the balanced path) provisions base workers at 80% of the calculated requirement with automatic scaling covering the remaining 20%, resulting in 2-3 minute delays during spikes while balancing cost against performance—suitable for moderate SLA requirements with some budget constraints.&lt;/p&gt; 
&lt;p&gt;The comparison table contrasts queue times (under 30 seconds versus 2-3 minutes versus 3-5 minutes), SLA compliance levels (guaranteed versus high probability versus at-risk during peak), and ideal use cases (mission-critical predictable workloads versus moderate SLA requirements with budget constraints versus development environments with flexible SLA tolerance), enabling teams to make informed provisioning decisions aligned with their operational requirements and financial constraints.&lt;/p&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/29/BDB-4941-1.jpeg"&gt;&lt;/p&gt; 
&lt;h2&gt;&lt;strong&gt;Key takeaway&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;Effective capacity planning prevents both under-provisioning (SLA breaches) and over-provisioning (cost overruns).&lt;/p&gt; 
&lt;h3&gt;&lt;strong&gt;Capacity planning principles&lt;/strong&gt;&lt;/h3&gt; 
&lt;ol&gt; 
 &lt;li&gt;&lt;strong&gt;Calculate capacity needs BEFORE adding workload&lt;/strong&gt; – Use peak task projections with 5-15% safety buffer&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Size minimum workers for peak demand&lt;/strong&gt; – Don’t rely on automatic scaling for predictable loads&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Use automatic scaling only for unexpected spikes&lt;/strong&gt; – Treat as safety net, not primary capacity&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Target 85-95% utilization during peak hours&lt;/strong&gt; – Ensures headroom for unexpected growth&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Plan 5-15% headroom for unexpected growth&lt;/strong&gt; – Production often differs from testing&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Monitor AdditionalWorkers metric&lt;/strong&gt; – If active &amp;gt;6 hours daily, increase base workers&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Review quarterly + trigger-based assessments&lt;/strong&gt; – Regular reviews plus immediate action on issues&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Balance cost and performance based on SLA criticality&lt;/strong&gt; – Business impact justifies infrastructure investment&lt;/li&gt; 
&lt;/ol&gt; 
&lt;h3&gt;&lt;strong&gt;Success metrics&lt;/strong&gt;&lt;/h3&gt; 
&lt;ul&gt; 
 &lt;li&gt;&lt;strong&gt;Queue efficiency:&lt;/strong&gt; Average queue time &amp;lt;30 seconds during peak&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;SLA compliance:&lt;/strong&gt; &amp;gt;99.5% of critical tasks complete on time&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Resource utilization:&lt;/strong&gt; 85-95% during peak hours (optimal efficiency)&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Cost predictability:&lt;/strong&gt; &amp;lt;10% variance in monthly worker costs&lt;/li&gt; 
&lt;/ul&gt; 
&lt;h2&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;/h2&gt; 
&lt;p&gt;Capacity planning is not a one-time exercise. It’s an ongoing discipline. The framework we’ve outlined gives you a repeatable process: measure your current peak utilization through CloudWatch metrics, project growth based on incoming workloads, calculate the required workers with an appropriate safety buffer, and monitor continuously to catch drift before it becomes an outage.&lt;/p&gt; 
&lt;p&gt;The financial services scenario in this post illustrates a common reality: running at 100% utilization during peak hours leaves zero room for the unexpected. By sizing to 95% peak utilization with a modest buffer, the team gained the headroom needed to absorb volatility without risking their 7 AM market-open SLA.&lt;/p&gt; 
&lt;p&gt;Whether you choose full base worker provisioning for mission-critical pipelines, a hybrid approach for moderate SLA requirements, or lean on automatic scaling for development workloads, the right strategy depends on your business context, not a one-size-fits-all rule. Pair your capacity plan with the CloudWatch alarms and review triggers we covered, and you’ll catch capacity gaps early.&lt;/p&gt; 
&lt;p&gt;Combined with the optimization-first approach from Part 1, you now have a complete toolkit: diagnose before you scale, optimize before you provision, and plan before you deploy. Your MWAA environment and your on-call engineers will thank you.&lt;/p&gt; 
&lt;p&gt;To get started, visit the &lt;a href="https://aws.amazon.com/managed-workflows-for-apache-airflow/" target="_blank" rel="noopener noreferrer"&gt;Amazon MWAA product page&lt;/a&gt; and the &lt;a href="https://console.aws.amazon.com/mwaa/home" target="_blank" rel="noopener noreferrer"&gt;Amazon MWAA console page&lt;/a&gt;.&lt;/p&gt; 
&lt;p&gt;If you have questions or want to share your MWAA capacity planning, leave a comment.&lt;/p&gt; 
&lt;h2&gt;About the authors&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="aligncenter size-full wp-image-29797" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/29/BDB-4941-2.jpeg" alt="Boyko Radulov" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Boyko Radulov&lt;/h3&gt; 
  &lt;p&gt;Boyko is a Senior Cloud Support Engineer at Amazon Web Services (AWS), Amazon MWAA and AWS Glue Subject Matter Expert. He works closely with customers to build and optimize their workloads on AWS while reducing the overall cost. Beyond work, he is passionate about sports and travelling.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="aligncenter size-full wp-image-29797" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/29/BDB-4941-3.jpeg" alt="Kamen Sharlandjiev" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Kamen Sharlandjiev&lt;/h3&gt; 
  &lt;p&gt;Kamen is a Principal Big Data and ETL Solutions Architect, Amazon MWAA and AWS Glue ETL expert. He’s on a mission to make life easier for customers who are facing complex data integration and orchestration challenges. His secret weapon? Fully managed AWS services that can get the job done with minimal effort. Follow Kamen on &lt;a href="https://www.linkedin.com/in/ksharlandjiev/" target="_blank" rel="noopener noreferrer"&gt;&lt;em&gt;LinkedIn&lt;/em&gt;&lt;/a&gt; to keep up to date with the latest Amazon MWAA and AWS Glue features and news.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="aligncenter size-full wp-image-29797" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/29/BDB-4941-4.jpeg" alt="Venu Thangalapally" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Venu Thangalapally&lt;/h3&gt; 
  &lt;p&gt;Venu is a Senior Solutions Architect at AWS, based in Chicago, with deep expertise in cloud architecture, data and analytics, containers, and application modernization. He partners with financial service industry customers to translate business goals into secure, scalable, and compliant cloud solutions that deliver measurable value. Venu is passionate about using technology to drive innovation and operational excellence.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="aligncenter size-full wp-image-29797" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/29/BDB-4941-5.jpeg" alt="Harshawardhan Kulkarni" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Harshawardhan Kulkarni&lt;/h3&gt; 
  &lt;p&gt;Harshawardhan is a Partner Technical Account Manager at AWS, Amazon MWAA Subject Matter Expert. Based in Dublin Ireland, he partners with Enterprise Customers across EMEA to help navigate complex workflows and orchestration challenges while ensuring best practice implementation. Outside of work, he enjoys traveling and spending time with his family.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="aligncenter size-full wp-image-29797" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/29/BDB-4941-6.jpeg" alt="Andrew McKenzie" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Andrew McKenzie&lt;/h3&gt; 
  &lt;p&gt;Andrew is a Data Engineer and Educator who uses deep technical expertise from his time at AWS. As a former Amazon MWAA Subject Matter Expert, he now focuses on building data solutions and teaching data engineering best practices.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
		
		
			</item>
		<item>
		<title>A guide to Airflow worker pool optimization in Amazon MWAA</title>
		<link>https://aws.amazon.com/blogs/big-data/a-guide-to-airflow-worker-pool-optimization-in-amazon-mwaa/</link>
					
		
		<dc:creator><![CDATA[Boyko Radulov]]></dc:creator>
		<pubDate>Fri, 01 May 2026 15:41:26 +0000</pubDate>
				<category><![CDATA[Best Practices]]></category>
		<category><![CDATA[Amazon EMR]]></category>
		<category><![CDATA[Analytics]]></category>
		<category><![CDATA[AWS Glue]]></category>
		<category><![CDATA[Big Data]]></category>
		<category><![CDATA[EMR]]></category>
		<guid isPermaLink="false">5a370db59f8f679e52ac60136a4bcab3d33ca08d</guid>

					<description>Optimizing the Airflow worker pool configuration in Amazon Managed Workflows for Apache Airflow (Amazon MWAA), the AWS fully managed Apache Airflow service, is an important yet often overlooked strategy for scaling workflow operations. Tasks queued for longer periods can create the illusion that additional workers are the solution, when in reality the root cause might […]</description>
										<content:encoded>&lt;p&gt;Optimizing the Airflow worker pool configuration in &lt;a href="http://aws.amazon.com/managed-workflows-for-apache-airflow" target="_blank" rel="noopener noreferrer"&gt;Amazon Managed Workflows for Apache Airflow&lt;/a&gt; (Amazon MWAA), the AWS fully managed Apache Airflow service, is an important yet often overlooked strategy for scaling workflow operations. Tasks queued for longer periods can create the illusion that additional workers are the solution, when in reality the root cause might lie elsewhere. The decision to scale isn’t always straightforward. DevOps engineers and system administrators frequently face the challenge of determining whether adding more workers will solve their performance issues or only increase operational cost without addressing the root cause.&lt;/p&gt; 
&lt;p&gt;This post explores different patterns for worker scaling decisions in Amazon MWAA, focusing on the task pool mechanism and its relationship to worker allocation. By examining specific scenarios and providing a practical decision framework, this post helps you determine whether adding workers is the right solution for your performance challenges, and if so, how to implement this scaling effectively.&lt;/p&gt; 
&lt;h1&gt;Main patterns&lt;/h1&gt; 
&lt;p&gt;This section discusses the most frequently seen problems that raise the question if adding additional workers would improve the health of your environment.&lt;/p&gt; 
&lt;h2&gt;High CPU&lt;/h2&gt; 
&lt;p&gt;Airflow serves as a workflow management platform that coordinates and schedules tasks to be run on external processing services. It acts as a central orchestrator that can trigger and monitor tasks across various data processing systems like &lt;a href="https://aws.amazon.com/glue/" target="_blank" rel="noopener noreferrer"&gt;AWS Glue&lt;/a&gt;, &lt;a href="https://aws.amazon.com/batch/" target="_blank" rel="noopener noreferrer"&gt;AWS Batch&lt;/a&gt;, &lt;a href="https://aws.amazon.com/emr/" target="_blank" rel="noopener noreferrer"&gt;Amazon EMR&lt;/a&gt;, and other specialized data processing tools. Rather than processing data itself, Airflow’s strength lies in managing complex workflows and coordinating jobs between different systems and services.&lt;/p&gt; 
&lt;p&gt;In Analytics and Big Data environments, there is a prevalent misconception that saturated resources automatically warrant adding more capacity. However, for Amazon MWAA, understanding your workflow characteristics and optimization opportunities should precede scaling decisions.&lt;/p&gt; 
&lt;p&gt;As you scale up your workflows, resource utilization of the Airflow clusters naturally increases. When workers consistently operate at full capacity, it may seem intuitive to add additional compute resources. However, this approach often masks underlying inefficiencies rather than resolving them.&lt;/p&gt; 
&lt;p&gt;For example, in Amazon MWAA if you are running a single task that is consuming 100% of the available CPU on your Amazon MWAA worker, adding additional workers will not resolve the problem as the task is not optimized nor split into smaller parts. As such, increasing the number of minimum workers will not bring the expected effect but will only increase the operating costs.&lt;/p&gt; 
&lt;p&gt;When your Amazon MWAA workers are consistently running above 90% CPU or Memory utilization, you’ve reached a critical decision point. Before taking actions, it is essential to understand the root cause. You have three primary options:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;Scale horizontally by adding additional workers to distribute the load.&lt;/li&gt; 
 &lt;li&gt;Scale vertically by upgrading to a larger environment class for more resources per worker.&lt;/li&gt; 
 &lt;li&gt;Optimize your DAGs and scheduling patterns to be more efficient and consume fewer resources.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;Each approach addresses different underlying issues, and choosing the right path depends on identifying whether you are facing a capacity constraint, resource-intensive task design, or workflow inefficiency. For guidance on optimization strategies, please refer to &lt;a href="https://docs.aws.amazon.com/mwaa/latest/userguide/best-practices-tuning.html" target="_blank" rel="noopener noreferrer"&gt;Performance tuning for Apache Airflow on Amazon MWAA&lt;/a&gt;.&lt;/p&gt; 
&lt;p&gt;To monitor the &lt;code&gt;CPUUtilization&lt;/code&gt; and &lt;code&gt;MemoryUtilization&lt;/code&gt; on the workers, refer to the &lt;a href="https://docs.aws.amazon.com/mwaa/latest/userguide/accessing-metrics-cw-container-queue-db.html#accessing-metrics-cw-container-queue-db-console" target="_blank" rel="noopener noreferrer"&gt;Accessing metrics in the Amazon CloudWatch console&lt;/a&gt; and choose the corresponding metrics.&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;Select a time window long enough to show usage patterns.&lt;/li&gt; 
 &lt;li&gt;Set period to &lt;strong&gt;1 Minute&lt;/strong&gt;.&lt;/li&gt; 
 &lt;li&gt;Set statistics to &lt;strong&gt;Maximum&lt;/strong&gt;.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;h2&gt;Long queue time&lt;/h2&gt; 
&lt;p&gt;Sometimes Airflow tasks are stuck in a queued state for a long time, which prevents DAGs from completing on time.&lt;/p&gt; 
&lt;p&gt;In Amazon MWAA, each environment class comes with configured minimum and maximum worker nodes. Each worker provides a pre-configured concurrency, which is the number of tasks that can run simultaneously on each worker at any given time. The behavior is controlled through &lt;code&gt;celery.worker_autoscale=(max,min).&lt;/code&gt;&lt;/p&gt; 
&lt;p&gt;For example, if you have minimum 4 mw1.small workers, with default Airflow configuration, you will be able to run 20 concurrent tasks (4 workers x 5 max_tasks_per_worker). If your system suddenly requires more than 20 tasks to execute concurrently, this will result in an autoscaling event. Amazon MWAA will decide how to scale your workers efficiently, and trigger the process. The autoscaling process, however, requires additional time to provision new workers resulting in additional tasks in queued status. To mitigate this queuing issue, consider the following:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;If the CPU utilization on the workers is low, increasing the &lt;code&gt;max&lt;/code&gt; value in &lt;code&gt;celery.worker_autoscale=(max,min)&lt;/code&gt; can reduce the time tasks stay in queued state as each worker will be able to process more tasks concurrently. Airflow worker can take tasks up to the defined task concurrency regardless of the availability of its own system resources. As a result, the base worker may reach 100% CPU/Memory utilization before Autoscaling takes effect.&lt;/li&gt; 
 &lt;li&gt;If you do not want to increase the task concurrency on the workers, increasing the minimum worker count can also be beneficial because having more available workers allows a higher number of tasks to run concurrently.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;h2&gt;Scheduling delays&lt;/h2&gt; 
&lt;p&gt;Adding new DAGs can not only affect your system resources, but it can also create uneven scheduling patterns. Some DAGs may experience delayed execution because of resource competition, even when the overall environment metrics appear healthy. This scheduling skew often manifests as inconsistent task pickup times, where certain workflows consistently wait longer in the queue while others execute promptly.&lt;/p&gt; 
&lt;p&gt;When &lt;a href="https://docs.aws.amazon.com/mwaa/latest/userguide/accessing-metrics-cw-container-queue-db.html" target="_blank" rel="noopener noreferrer"&gt;Amazon CloudWatch metrics&lt;/a&gt; show increasing variance in task scheduling times, particularly during periods of high DAG activity, it signals the need for environment optimization. This scenario requires careful analysis of execution patterns and resource utilization to determine if:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;While adding workers can help distribute the workload, this solution is most effective when the high utilization is primarily because of task execution load rather than DAG parsing or scheduling overhead. Adding more minimum workers will allow you to execute more tasks in parallel. For example, if you observe the value of &lt;code&gt;AWS/MWAA/ApproximateAgeOfOldestTask &lt;/code&gt;to be steadily increasing, it means that the workers are not able to consume the messages from the queue fast enough. Additionally, you can also monitor the &lt;code&gt;AWS/MWAA/QueuedTasks&lt;/code&gt; to identify similar patterns.&lt;/li&gt; 
 &lt;li&gt;Upgrading the environment class would provide better scheduling capacity. If the Scheduler is showing signs of strain or if you’re seeing high resource utilization across all components, upgrading to a larger environment class might be the most appropriate solution. This provides more resources to both the Scheduler and Workers, allowing for better handling of increased DAG complexity and volume. To validate the same, use &lt;code&gt;AWS/MWAA/CPUUtilization&lt;/code&gt; and &lt;code&gt;AWS/MWAA/MemoryUtilization&lt;/code&gt; in the Cluster metrics and choose &lt;code&gt;Scheduler,&lt;/code&gt; &lt;code&gt;BaseWorker&lt;/code&gt; and &lt;code&gt;AdditionalWorker&lt;/code&gt; metrics.&lt;/li&gt; 
 &lt;li&gt;Restructuring DAG schedules would reduce resource contention.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;The key is to understand your workflow patterns and identify whether the scheduling delays are because of insufficient worker capacity or other environmental constraints.&lt;/p&gt; 
&lt;h1&gt;Anti patterns&lt;/h1&gt; 
&lt;p&gt;This section showcases the most common anti patterns which make MWAA users think that adding more workers will improve performance.&lt;/p&gt; 
&lt;h2&gt;Underutilized workers&lt;/h2&gt; 
&lt;p&gt;When evaluating Amazon MWAA performance bottlenecks, it’s important to distinguish resource constraints and DAG design inefficiencies before scaling the environment.&lt;/p&gt; 
&lt;p&gt;Sometimes the Amazon MWAA environment has the capacity to run 100 tasks concurrently but your queue metrics (&lt;code&gt;AWS/MWAA/RunningTasks&lt;/code&gt;) show only 20 tasks active most of the time with no tasks remaining in queued state. In such scenarios, you are advised to check &lt;a href="https://docs.aws.amazon.com/mwaa/latest/userguide/accessing-metrics-cw-container-queue-db.html#accessing-metrics-cw-container-queue-db-list" target="_blank" rel="noopener noreferrer"&gt;Amazon CloudWatch&lt;/a&gt; for consistently low CPU and memory usage on existing workers during peak workload times. If this is confirmed, it is usually an indication of inefficiencies in DAG design, scheduling patterns, or Airflow configuration.&lt;/p&gt; 
&lt;p&gt;You have two primary options to address this:&lt;/p&gt; 
&lt;p&gt;1. &lt;strong&gt;Downsize&lt;/strong&gt;: If you do not expect your workload to increase, it is safe to assume you have over-provisioned your cluster. Start by removing any extra workers first and finally resolve to downsizing your environment class.&lt;/p&gt; 
&lt;p&gt;2. &lt;strong&gt;Optimize&lt;/strong&gt;: Fine tune your DAG scheduling and airflow configuration through Pools and Airflow configuration for concurrency to increase the throughput of your system.&lt;/p&gt; 
&lt;h2&gt;Misconfigured Airflow configurations that create artificial bottlenecks&lt;/h2&gt; 
&lt;p&gt;In Apache Airflow, performance bottlenecks often occur because of configuration settings, not actual resource constraints. At such times, DAG executions get delayed not because of insufficient compute, but because of incorrect concurrency configuration.&lt;/p&gt; 
&lt;p&gt;Efficient use of Amazon MWAA requires reviewing not only resource utilization for Workers and Schedulers but also concurrency configurations for artificially created bottlenecks. Sometimes one restrictive configuration prevents the scaling benefits of larger environment or additional workers. Always audit Airflow configurations if performance seems limited even when system metrics suggest spare capacity.&lt;/p&gt; 
&lt;p&gt;&lt;em&gt;&lt;strong&gt;Important consideration&lt;/strong&gt;: Amazon Managed Workflows for Apache Airflow (Amazon MWAA) does not automatically update the worker concurrency configuration when you change the environment class. This behavior is important to understand when scaling your environment. If you initially create an mw1.small environment, where each worker can handle up to 5 concurrent tasks by default. When you upgrade to a medium environment class (which supports 10 concurrent tasks per worker by default), the concurrency setting &lt;strong&gt;remains at 5&lt;/strong&gt; for in-place updated environments. You must manually update the concurrency configuration to take full advantage of the increased capacity available in the medium environment class.&lt;/em&gt;&lt;/p&gt; 
&lt;p&gt;Because of this you need to also update the Airflow configurations that control concurrency whenever you update the environment class. To update the concurrency setting after upgrading your environment class, modify the &lt;code&gt;celery.worker_autoscale&lt;/code&gt; configuration in your Apache Airflow configuration options. This makes sure your workers can process the maximum number of concurrent tasks supported by your new environment class.&lt;/p&gt; 
&lt;p&gt;Other times, an Amazon MWAA environment can be constrained by &lt;code&gt;max_active_runs&lt;/code&gt; or DAG concurrency controls instead of actual resource limits. These configuration-based throttles prevent tasks from running, even when the worker instances have available compute to handle the workload.&lt;/p&gt; 
&lt;p&gt;There is an important distinction between the two. Configuration limits act as artificial caps on parallelism, while true resource limits indicate that workers are fully utilizing their CPU or memory capacity. Understanding which type of constraint affects your environment helps you determine whether to adjust configuration settings or scale your infrastructure.&lt;/p&gt; 
&lt;p&gt;Adjusting Airflow configurations such as Pools, concurrency, max_active_runs solves performance problems without scaling workers. Some of the configurations you can use to control this behavior:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;&lt;strong&gt;max_active_runs_per_dag&lt;/strong&gt; (DAG level): Controls how many DAG runs for a given DAG are allowed at the same time. If set to 2, only 2 DAG runs can run concurrently, even if there is plenty of worker capacity left. Extra runs queue, making the DAG executions slow even though workers are idle.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;max_active_tasks:&lt;/strong&gt;Controls the concurrency field in a DAG definition (or setting at environment level) limits the number of tasks from the DAG running at any moment, regardless of overall system capacity or number of workers.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Pools:&lt;/strong&gt;Pools restrict how many tasks of a certain type (often resource heavy) can run at once. A pool with only 3 slots will throttle any tasks above 3 assigned to that pool, leaving workers idle.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Execution timeouts and retries:&lt;/strong&gt; If not tuned, failed tasks might fill up slots unnecessarily, stuck tasks can block worker slots and slow queue processing.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Scheduling intervals and dependencies:&lt;/strong&gt; Overlapping or inefficient scheduling may cause idle periods or excess contention for resources, affecting real throughput.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;strong&gt;How Airflow configurations can override each other&lt;/strong&gt;&lt;/p&gt; 
&lt;p&gt;Airflow has multiple layers of concurrency and scheduling controls. Some at the environment level, some at the DAG/task level, and others for pools. Sometimes more restrictive settings override more permissive ones, resulting in unexpected queue buildup.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;DAG level vs Environment level:&lt;/strong&gt; If “max_active_runs_per_dag” (DAG level) is lower than the environment-level “max_active_runs_per_dag” or system wide concurrency, the DAG setting is used, throttling tasks even if the environment could do more.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Task level overrides:&lt;/strong&gt; Individual task definitions can have their own parameters like “max_active_tis_per_dag” which can cap runs per task and create a bottleneck if set lower than global settings.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Order of precedence:&lt;/strong&gt; The most restrictive relevant configuration at any level (Environment, DAG, Task) effectively sets the upper bound for parallel task execution.&lt;/p&gt; 
&lt;table border="1px" cellpadding="10px"&gt; 
 &lt;tbody&gt; 
  &lt;tr&gt; 
   &lt;td&gt;&lt;strong&gt;Setting Location&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Setting&lt;/strong&gt;&lt;/td&gt; 
   &lt;td&gt;&lt;strong&gt;Effect on task throughput&lt;/strong&gt;&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Environment Level&lt;/td&gt; 
   &lt;td&gt;parallelism&lt;/td&gt; 
   &lt;td&gt;Max total tasks running on Scheduler&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;DAG Level&lt;/td&gt; 
   &lt;td&gt;max_active_runs&lt;/td&gt; 
   &lt;td&gt;Max simultaneous DAG runs&lt;/td&gt; 
  &lt;/tr&gt; 
  &lt;tr&gt; 
   &lt;td&gt;Task Level&lt;/td&gt; 
   &lt;td&gt;concurrency&lt;/td&gt; 
   &lt;td&gt;Max concurrent task for that DAG&lt;/td&gt; 
  &lt;/tr&gt; 
 &lt;/tbody&gt; 
&lt;/table&gt; 
&lt;p&gt;Performance issues often resemble resource exhaustion, but actually derive from overly restrictive configurations. Audit all the preceding parameters carefully. You can loosen restrictive values step by step and monitor their effect before deciding to scale your cluster further. This approach ensures optimal and cost-efficient usage of your cloud resources without paying for idle capacity.&lt;/p&gt; 
&lt;h2&gt;Slow resource depletion from memory leaks&lt;/h2&gt; 
&lt;p&gt;A common scenario for memory leak or slow resource depletion in Amazon MWAA is when DAGs and tasks begin to fail or slow down over time. Scaling workers or increasing environment size does not resolve the underlying issue. This happens because the root cause is not a lack of capacity but rather an application-level leak that causes persistent exhaustion.&lt;/p&gt; 
&lt;p&gt;For example, as Airflow continuously runs tasks and parses DAGs over time, memory consumption can steadily increase across the environment. This might manifest as an Amazon MWAA metadata database experiencing declining FreeableMemory metrics despite consistent or even reduced workloads. When this occurs, database query performance gradually declines as memory resources become constrained for scheduler/worker &amp;amp; metadata database, ultimately affecting overall environment responsiveness since Airflow depends heavily on its metadata database for critical operations. This scenario is similar to how an application might create database connections without properly closing them, leading to resource exhaustion over time.&lt;/p&gt; 
&lt;h3&gt;Graph: Declining FreeableMemory and MemoryUtilization&lt;/h3&gt; 
&lt;p&gt;&lt;img src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/30/graph-freeablememory-memoryutilization-2026-04-30.png"&gt;&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;Common causes:&lt;/strong&gt;&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;&lt;strong&gt;Connection pool exhaustion:&lt;/strong&gt; DAGs that fail to properly close database connections can lead to connection pool exhaustion and memory leaks in the database.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Resource-intensive operations:&lt;/strong&gt; Complex, long-running queries or XCOM operations against the metadata database can consume excessive memory.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Inefficient DAG design:&lt;/strong&gt; DAGs with numerous top-level Python calls can trigger database queries during DAG parsing. For instance, using variable.get() calls at the DAG level rather than at the task level creates unnecessary database load.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;&lt;strong&gt;Recommended solutions:&lt;/strong&gt;&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;&lt;strong&gt;Implement Amazon CloudWatch monitoring:&lt;/strong&gt; Establish Amazon CloudWatch alarms for FreeableMemory with appropriate thresholds to detect issues early.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Regular database maintenance:&lt;/strong&gt; Perform scheduled database clean-up operations to purge historical data that is no longer needed.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Optimize DAG code:&lt;/strong&gt; Refactor DAGs to move database operations like variable.get() from the DAG level to the task level to reduce parsing overhead.&lt;/li&gt; 
 &lt;li&gt;&lt;strong&gt;Connection management:&lt;/strong&gt; Make sure all database connections are properly closed after use to prevent connection pool exhaustion.&lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;By following the preceding recommendations you can maintain healthy memory utilization for the metadata database and maintain optimal performance of your Amazon MWAA environment without needing to scale workers.&lt;/p&gt; 
&lt;h1&gt;Conclusion&lt;/h1&gt; 
&lt;p&gt;The decision to add workers in Amazon MWAA environments requires careful consideration of multiple factors beyond simple task queue metrics. In this post, we showed that while adding workers can address certain performance challenges, it’s often not the optimal first response to system bottlenecks.&lt;/p&gt; 
&lt;p&gt;Key considerations before scaling workers include:&lt;/p&gt; 
&lt;ol&gt; 
 &lt;li&gt;Root cause analysis 
  &lt;ul&gt; 
   &lt;li&gt;Verify whether high CPU/memory usage stems from task optimization issues.&lt;/li&gt; 
   &lt;li&gt;Examine if queuing problems result from configuration constraints rather than resource limitations.&lt;/li&gt; 
   &lt;li&gt;Investigate potential memory leaks or resource depletion patterns.&lt;/li&gt; 
  &lt;/ul&gt; &lt;/li&gt; 
 &lt;li&gt;Configuration optimization 
  &lt;ul&gt; 
   &lt;li&gt;Review and adjust Airflow parameters (concurrency settings, pools, timeouts).&lt;/li&gt; 
   &lt;li&gt;Understand the interaction between different configuration layers.&lt;/li&gt; 
   &lt;li&gt;Optimize DAG design and scheduling patterns.&lt;/li&gt; 
  &lt;/ul&gt; &lt;/li&gt; 
&lt;/ol&gt; 
&lt;p&gt;The most successful Amazon MWAA implementations follow a systematic approach: first optimizing existing resources and configurations, then scaling workers only when justified by data-driven capacity planning. This approach ensures cost-effective operations while maintaining reliable workflow performance.&lt;/p&gt; 
&lt;p&gt;Remember that worker scaling is only one tool in the Amazon MWAA optimization toolkit. Long-term success depends on building a comprehensive performance management strategy that combines proper monitoring, proactive capacity planning, and continuous optimization of your Airflow workflows.&lt;/p&gt; 
&lt;p&gt;In the next post, we discuss capacity planning and the steps you need to perform before adding additional DAGs in your environment so that you can plan for the additional load and make sure you have enough headroom.&lt;/p&gt; 
&lt;p&gt;To get started, visit the &lt;a href="https://aws.amazon.com/managed-workflows-for-apache-airflow/" target="_blank" rel="noopener noreferrer"&gt;Amazon MWAA product page&lt;/a&gt; and the &lt;a href="https://docs.aws.amazon.com/mwaa/latest/userguide/best-practices-tuning.html" target="_blank" rel="noopener noreferrer"&gt;Performance tuning for Apache Airflow on Amazon MWAA&lt;/a&gt; page.&lt;/p&gt; 
&lt;p&gt;If you have questions or want to share your MWAA scaling experiences, leave a comment below.&lt;/p&gt; 
&lt;h2&gt;About the authors&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="aligncenter size-full wp-image-29797" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/29/BDB-4941-2.jpeg" alt="Boyko Radulov" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Boyko Radulov&lt;/h3&gt; 
  &lt;p&gt;Boyko is a Senior Cloud Support Engineer at Amazon Web Services (AWS), Amazon MWAA and AWS Glue Subject Matter Expert. He works closely with customers to build and optimize their workloads on AWS while reducing the overall cost. Beyond work, he is passionate about sports and travelling.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="aligncenter size-full wp-image-29797" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/29/BDB-4941-3.jpeg" alt="Kamen Sharlandjiev" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Kamen Sharlandjiev&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/ksharlandjiev/" target="_blank" rel="noopener"&gt;Kamen&lt;/a&gt; is a Principal Big Data and ETL Solutions Architect, Amazon MWAA and AWS Glue ETL expert. He’s on a mission to make life easier for customers who are facing complex data integration and orchestration challenges. His secret weapon? Fully managed AWS services that can get the job done with minimal effort. Follow Kamen on &lt;a href="https://www.linkedin.com/in/ksharlandjiev/" target="_blank" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt; to keep up to date with the latest Amazon MWAA and AWS Glue features and news.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="aligncenter size-full wp-image-29797" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/29/BDB-4941-4.jpeg" alt="Venu Thangalapally" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Venu Thangalapally&lt;/h3&gt; 
  &lt;p&gt;Venu is a Senior Solutions Architect at AWS, based in Chicago, with deep expertise in cloud architecture, data and analytics, containers, and application modernization. He partners with financial service industry customers to translate business goals into secure, scalable, and compliant cloud solutions that deliver measurable value. Venu is passionate about using technology to drive innovation and operational excellence.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="aligncenter size-full wp-image-29797" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/29/BDB-4941-5.jpeg" alt="Harshawardhan Kulkarni" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Harshawardhan Kulkarni&lt;/h3&gt; 
  &lt;p&gt;Harshawardhan is a Partner Technical Account Manager at AWS, Amazon MWAA Subject Matter Expert. Based in Dublin Ireland, he partners with Enterprise Customers across EMEA to help navigate complex workflows and orchestration challenges while ensuring best practice implementation. Outside of work, he enjoys traveling and spending time with his family.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="aligncenter size-full wp-image-29797" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/29/BDB-4941-6.jpeg" alt="Andrew McKenzie" width="120" height="160"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Andrew McKenzie&lt;/h3&gt; 
  &lt;p&gt;Andrew is a Data Engineer and Educator who uses deep technical expertise from his time at AWS. As a former Amazon MWAA Subject Matter Expert, he now focuses on building data solutions and teaching data engineering best practices.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
					
			
		
		
			</item>
		<item>
		<title>Unified observability in Amazon OpenSearch Service: metrics, traces, and AI agent debugging in a single interface</title>
		<link>https://aws.amazon.com/blogs/big-data/unified-observability-in-amazon-opensearch-service-metrics-traces-and-ai-agent-debugging-in-a-single-interface/</link>
					
		
		<dc:creator><![CDATA[Muthu Pitchaimani]]></dc:creator>
		<pubDate>Tue, 28 Apr 2026 17:29:01 +0000</pubDate>
				<category><![CDATA[Amazon OpenSearch Service]]></category>
		<category><![CDATA[Analytics]]></category>
		<category><![CDATA[Launch]]></category>
		<guid isPermaLink="false">1402dbf1fe29f4a55e798bc7d879de3064ae09bf</guid>

					<description>Amazon OpenSearch Service now brings application monitoring, native Amazon Managed Service for Prometheus integration, and AI agent tracing together in OpenSearch UI's observability workspace. In this post, we walk through two real-world scenarios using the OpenTelemetry sample app: a multi-agent travel planner facing slow processing, and a checkout flow quietly failing on one microservice.</description>
										<content:encoded>&lt;p&gt;&lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/what-is.html" target="_blank" rel="noopener noreferrer"&gt;Amazon OpenSearch Service&lt;/a&gt; now brings application monitoring, native &lt;a href="https://docs.aws.amazon.com/prometheus/latest/userguide/what-is-Amazon-Managed-Service-Prometheus.html" target="_blank" rel="noopener noreferrer"&gt;Amazon Managed Service for Prometheus&lt;/a&gt; integration, and AI agent tracing together in &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/application.html" target="_blank" rel="noopener noreferrer"&gt;OpenSearch UI&lt;/a&gt;‘s observability workspace. You can query Prometheus metrics with &lt;a href="https://prometheus.io/docs/prometheus/latest/querying/basics/" target="_blank" rel="noopener noreferrer"&gt;PromQL&lt;/a&gt; alongside logs and traces stored in Amazon OpenSearch Service, trace an AI agent’s full reasoning chain down to the failing tool call, and drill from a service-level health view to the exact span that caused a checkout failure, all without leaving the interface.&lt;/p&gt; 
&lt;p&gt;In this post, we walk through two real-world scenarios using the OpenTelemetry sample app: a multi-agent travel planner facing slow processing, and a checkout flow quietly failing on one microservice. We chase each one to its root cause using these new capabilities.&lt;/p&gt; 
&lt;h2&gt;Scenario 1: An underperforming AI agent&lt;/h2&gt; 
&lt;p&gt;Your multi-agent travel planner is live and users start reporting slow responses. With the new AI agent tracing capability in Amazon OpenSearch Service, you can trace the agent’s full processing path to pinpoint exactly where things went wrong.&lt;/p&gt; 
&lt;p&gt;In any observability workspace in OpenSearch UI, navigate to &lt;strong&gt;Application Map&lt;/strong&gt; in the left navigation pane.&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90438" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/image003.jpg" alt="OpenSearch Service application map" width="2258" height="1520"&gt;&lt;/p&gt; 
&lt;p&gt;You can see the full topology of your system including the travel agent and the sub-agents it calls. The travel agent node shows elevated latency and occasional errors. Select it, and the side panel confirms that latency is up but the latency chart shows intermittent spikes rather than consistent degradation.&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90439" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/image005-scaled.jpg" alt="System topology with service health metrics" width="2560" height="1302"&gt;&lt;/p&gt; 
&lt;p&gt;The application map tells you something is wrong, but understanding &lt;em&gt;why&lt;/em&gt; an AI agent is underperforming requires seeing its reasoning chain. Select &lt;strong&gt;Agent Traces&lt;/strong&gt; in the left navigation pane, then filter by service name and time range.&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90440" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/image007.png" alt="Agent processing steps with invocation data" width="1430" height="728"&gt;&lt;/p&gt; 
&lt;p&gt;Select one of the traces to see the trace tree. Unlike a traditional span waterfall, this view organizes around the agent’s reasoning chain: the root agent span, the LLM calls it made, the tools it invoked, and how they nested each step color-coded by type. The trace map provides a visual directed graph of the same execution. You can see which model was called, how many input and output tokens were consumed, and the actual messages sent to and received from the model.&lt;/p&gt; 
&lt;p&gt;A tool call inside the weather agent errored out. The agent then spent additional time reasoning about the failure before returning a partial response explaining the intermittent latency spikes and occasional faults.&lt;/p&gt; 
&lt;h3&gt;Why this matters for AI agents&lt;/h3&gt; 
&lt;p&gt;Agents make autonomous decisions based on LLM responses, tool results, and chained reasoning. Unlike traditional microservices with deterministic code paths, agent behavior varies across executions. Without semantic tracing that captures these AI-specific signals, root-cause analysis is guesswork. The trace tree surfaced the model name, token counts, and failing tool call because the travel planner was instrumented with OpenTelemetry’s generative AI semantic conventions. The next section describes how.&lt;/p&gt; 
&lt;h3&gt;Instrumenting AI agents&lt;/h3&gt; 
&lt;p&gt;OpenTelemetry auto-instrumentation enriches spans with well-known attributes for HTTP, database, and gRPC calls. AI agents need a different set of attributes such as which LLM was called, what tokens were consumed, which tools were invoked, that standard instrumentation doesn’t cover.&lt;/p&gt; 
&lt;p&gt;The &lt;a href="https://opentelemetry.io/docs/specs/semconv/gen-ai/" target="_blank" rel="noopener"&gt;OpenTelemetry gen_ai semantic conventions&lt;/a&gt; define standard attributes for these signals, including &lt;code&gt;gen_ai.operation.name&lt;/code&gt;, &lt;code&gt;gen_ai.usage.input_tokens&lt;/code&gt;, &lt;code&gt;gen_ai.request.model&lt;/code&gt;, and &lt;code&gt;gen_ai.tool.name&lt;/code&gt;. When Amazon OpenSearch Service receives spans with these attributes, it categorizes them by operation type (agent, LLM, tool, embeddings, retrieval) and renders the agent trace tree and trace map views.&lt;/p&gt; 
&lt;p&gt;The Python SDK provides one way to generate these spans. To send traces to Amazon OpenSearch Ingestion, configure the SDK with AWS Signature Version 4 (SigV4) authentication. The &lt;code&gt;AWSSigV4OTLPExporter&lt;/code&gt; cryptographically signs each HTTP request to help prevent unauthorized data ingestion. The calling identity needs an IAM policy that grants &lt;code&gt;osis:Ingest&lt;/code&gt; on your pipeline’s ARN. Credentials are resolved through the standard AWS credential provider chain.&lt;/p&gt; 
&lt;pre&gt;&lt;code class="language-python"&gt;from opensearch_genai_observability_sdk_py import register, AWSSigV4OTLPExporter

exporter = AWSSigV4OTLPExporter(
    endpoint="https://pipeline.us-east-1.osis.amazonaws.com/v1/traces",
    service="osis",
    region="us-east-1",
)

register(service_name="my-agent", exporter=exporter)
&lt;/code&gt;&lt;/pre&gt; 
&lt;p&gt;Use the &lt;code&gt;@observe&lt;/code&gt; decorator to trace agent functions and &lt;code&gt;enrich()&lt;/code&gt; to add model metadata:&lt;/p&gt; 
&lt;pre&gt;&lt;code class="language-python"&gt;@observe(op=Op.EXECUTE_TOOL)
def get_weather(city: str) -&amp;gt; dict:
    return {"city": city, "temp": 22, "condition": "sunny"}

@observe(op=Op.INVOKE_AGENT)
def assistant(query: str) -&amp;gt; str:
    enrich(model="gpt-4o", provider="openai")
    data = get_weather("Paris")
    return f"{data['condition']}, {data['temp']}C"

result = assistant("What's the weather?")
&lt;/code&gt;&lt;/pre&gt; 
&lt;p&gt;The SDK also supports auto-instrumentation for OpenAI, Anthropic, Amazon Bedrock, LangChain, LlamaIndex, and others. Because the instrumentation is built on OpenTelemetry standards, any agent framework that emits spans with &lt;code&gt;gen_ai.*&lt;/code&gt; attributes is compatible with OpenSearch UI.&lt;/p&gt; 
&lt;h2&gt;Scenario 2: Investigating a microservice issue&lt;/h2&gt; 
&lt;p&gt;AI agents are only one part of most production environments. The same interface surfaces telemetry from conventional microservices, where the troubleshooting workflow follows a more familiar path.&lt;/p&gt; 
&lt;p&gt;Your ecommerce checkout begins paging during a busy traffic window. From OpenSearch UI, navigate to &lt;strong&gt;APM Services&lt;/strong&gt; in the left navigation pane. Every instrumented service is listed alongside its health indicators. The checkout service shows an elevated error rate.&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90441" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/image009-scaled.jpg" alt="Service overview panel with request, error, duration metrics" width="2560" height="1306"&gt;&lt;/p&gt; 
&lt;p&gt;Select the affected service. The detail view shows Request, Error, and Duration (RED) metrics: request rate is climbing, fault rate has spiked in the last 15 minutes, and p99 duration has doubled. You can see exactly when the degradation started.&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90442" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/image011.png" alt="Service drilldown health dashboard" width="1431" height="723"&gt;&lt;/p&gt; 
&lt;p&gt;Drill into the correlated spans for the affected time window. The span list shows multiple failed requests, all hitting the same endpoint. Select one to see the full trace waterfall. The checkout service called &lt;code&gt;prepareOrder&lt;/code&gt;, which failed trying to retrieve a product from the catalog. The error message in the span details tells you exactly what went wrong, that’s your root cause.&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone wp-image-90443 size-full" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/image013.png" alt="Waterfall transaction view of spans" width="1429" height="730"&gt;&lt;/p&gt; 
&lt;h3&gt;Checking the infrastructure with PromQL&lt;/h3&gt; 
&lt;p&gt;In both scenarios, the natural next question is whether the problem originates in the application or in the infrastructure beneath it. With the new Amazon Managed Service for Prometheus integration, you can answer that question without leaving OpenSearch UI.&lt;/p&gt; 
&lt;p&gt;Prometheus metrics are now queryable directly from the same workspace using native PromQL syntax, alongside the logs and traces you’ve already been navigating.&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90444" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/image015.png" alt="Metric query showing Prometheus Query Language" width="1431" height="820"&gt;&lt;/p&gt; 
&lt;p&gt;For the database timeout in Scenario 2, run a PromQL query to check the database instance’s read/write throughput for the same time window. For the agent latency issue in Scenario 1, check the LLM endpoint’s response time metrics to see if the slowness originates from the model provider.&lt;/p&gt; 
&lt;p&gt;This is a key architectural decision: metrics continue to live in Amazon Managed Service for Prometheus, logs and traces continue to live in Amazon OpenSearch Service, and neither signal is copied or warehoused into a second store. Each backend remains the single store for the data type it’s purpose-built to handle, while OpenSearch UI federates queries across both at runtime. The cost, retention, and operational model of each store stay intact while the troubleshooting workflow collapses into a single interface.&lt;/p&gt; 
&lt;p&gt;To configure the OpenTelemetry Collector and OpenSearch Ingestion pipelines that route metrics into Amazon Managed Service for Prometheus, see &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/observability-ingestion.html" target="_blank" rel="noopener"&gt;Ingesting application telemetry&lt;/a&gt;.&lt;/p&gt; 
&lt;h2&gt;How it’s wired together&lt;/h2&gt; 
&lt;p&gt;The following diagram shows the end-to-end architecture. Applications instrumented with OpenTelemetry send traces, logs, and metrics over OTLP to Amazon OpenSearch Ingestion. OpenSearch Ingestion routes each signal to the appropriate store: traces and logs land in Amazon OpenSearch Service, while metrics flow into Amazon Managed Service for Prometheus. OpenSearch UI then queries both stores to render the Application Map, Services catalog, Agent Traces, and Metrics views.&lt;/p&gt; 
&lt;p&gt;&lt;img loading="lazy" class="alignnone size-full wp-image-90446" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/image019.png" alt="OpenSearch Observability Stack Architecture" width="1202" height="472"&gt;&lt;/p&gt; 
&lt;p&gt;The entire experience rests on open-source foundations, Prometheus for metrics, OpenSearch for logs and traces, and OpenTelemetry for instrumentation, so teams already running an OpenTelemetry collector can adopt it by updating the collector’s export configuration to point at Amazon OpenSearch Ingestion, with no proprietary agents or rewritten instrumentation required.&lt;/p&gt; 
&lt;h2&gt;Getting started&lt;/h2&gt; 
&lt;p&gt;To enable these capabilities, log in to OpenSearch UI’s observability workspace, select the &lt;strong&gt;Gear&lt;/strong&gt; icon in the bottom left corner to open Settings and setup, and verify that the &lt;strong&gt;Observability:apmEnabled&lt;/strong&gt; toggle is on under the Observability section. OpenSearch UI is available at no additional charge for Amazon OpenSearch Service customers.&lt;/p&gt; 
&lt;div style="width: 640px;" class="wp-video"&gt;
 &lt;video class="wp-video-shortcode" id="video-90656-1" width="640" height="360" preload="metadata" controls="controls"&gt;
  &lt;source type="video/mp4" src="https://d2908q01vomqb2.cloudfront.net/artifacts/DBSBlogs/BDB-5856/BDB-5856.mp4?_=1"&gt;
 &lt;/video&gt;
&lt;/div&gt; 
&lt;p&gt;&lt;strong&gt;Explore locally first.&lt;/strong&gt; The &lt;a href="https://opensearch.org/platform/observability-stack/" target="_blank" rel="noopener"&gt;OpenSearch Observability Stack&lt;/a&gt; gives you a fully configured environment including application monitoring, agent tracing, and Prometheus integration, running on your machine with a single install command. It ships with sample instrumented services, including a multi-agent travel planner, so you can explore the full workflow with real telemetry data out of the box.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;For AI agent development.&lt;/strong&gt; &lt;a href="https://observability.opensearch.org/docs/agent-health/" target="_blank" rel="noopener"&gt;Agent Health&lt;/a&gt; is an open-source, evaluation-driven observability tool designed for local development. It gives you execution flow graphs, token tracking, and tool invocation visibility right in your development loop, before you push to production.&lt;/p&gt; 
&lt;p&gt;&lt;strong&gt;For production.&lt;/strong&gt; The &lt;a href="https://observability.opensearch.org/docs/send-data/ai-agents/python/" target="_blank" rel="noopener"&gt;Python SDK&lt;/a&gt; provides one-line setup and decorator-based tracing with gen_ai semantic conventions, with auto-instrumentation support for OpenAI, Anthropic, Amazon Bedrock, LangChain, LlamaIndex, and others. See the &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/observability.html" target="_blank" rel="noopener"&gt;Amazon OpenSearch Service documentation&lt;/a&gt; and the &lt;a href="https://docs.aws.amazon.com/opensearch-service/latest/developerguide/direct-query-prometheus-overview.html" target="_blank" rel="noopener"&gt;Amazon Managed Service for Prometheus integration guide&lt;/a&gt; for the full managed experience.&lt;/p&gt; 
&lt;hr style="width: 80%"&gt; 
&lt;h2&gt;About the authors&lt;/h2&gt; 
&lt;footer&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignnone size-full wp-image-90447" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/image021.png" alt="" width="100" height="133"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Muthu Pitchaimani&lt;/h3&gt; 
  &lt;p&gt;Muthu is a Search Specialist with Amazon OpenSearch Service. He builds large-scale search applications and solutions. Muthu is interested in the topics of networking and security, and is based out of Austin, Texas.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignnone size-full wp-image-90450" style="font-size: 16px" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/image022.png" alt="" width="100" height="102"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Raaga N.G&lt;/h3&gt; 
  &lt;p&gt;&lt;a href="https://www.linkedin.com/in/raaga-shree/" target="_blank" rel="noopener noreferrer"&gt;Raaga&lt;/a&gt; is a Solutions Architect at AWS with over 5 years of experience helping enterprises modernize their technology landscape and build scalable, cloud-native solutions. She partners with customers to translate business requirements into efficient cloud architectures that drive measurable outcomes, supporting their journey from application modernization to AI adoption through thoughtful, customer-centric solutions.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignnone size-full wp-image-90448" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/image023.png" alt="" width="1920" height="2560"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Rekha Thottan&lt;/h3&gt; 
  &lt;p&gt;Rekha Thottan is a Senior Technical Product Manager at AWS OpenSearch, contributing to AI agent observability and evaluation for the OpenSearch Project.&lt;/p&gt; 
 &lt;/div&gt; 
 &lt;div class="blog-author-box"&gt; 
  &lt;div class="blog-author-image"&gt;
   &lt;img loading="lazy" class="alignnone size-full wp-image-90449" src="https://d2908q01vomqb2.cloudfront.net/b6692ea5df920cad691c20319a6fffd7a4a766b8/2026/04/20/image025.png" alt="" width="576" height="768"&gt;
  &lt;/div&gt; 
  &lt;h3 class="lb-h4"&gt;Kevin Lewin&lt;/h3&gt; 
  &lt;p&gt;Kevin is a Cloud Operations Specialist Solution Architect at Amazon Web Services. He focuses on helping customers achieve their operational goals through observability and automation.&lt;/p&gt; 
 &lt;/div&gt; 
&lt;/footer&gt;</content:encoded>
					
					
			
		
		<enclosure length="30351156" type="video/mp4" url="https://d2908q01vomqb2.cloudfront.net/artifacts/DBSBlogs/BDB-5856/BDB-5856.mp4"/>

			</item>
	</channel>
</rss>