AWS Big Data Blog

Automate deployment of data and AI applications with Amazon SageMaker Unified Studio CI/CD CLI

Saurabh Bhutyani — Thu, 21 May 2026 19:13:50 +0000

Organizations building data and AI applications in Amazon SageMaker Unified Studio combine multiple AWS services, including AWS Glue, Amazon Athena, Amazon Managed Workflows for Apache Airflow (Amazon MWAA), Amazon SageMaker AI, and Amazon Quick Sight, into single applications. Promoting these applications from development to test and production stages requires substituting service-specific configurations for each stage and provisioning resources in the correct order.

Data teams understand which services their applications need but lack continuous integration and continuous delivery (CI/CD) expertise, while DevOps teams understand deployment automation but must learn each AWS service’s provisioning requirements.

The CI/CD CLI for Amazon SageMaker Unified Studio (aws-smus-cicd-cli) is an open source command line tool that automates deployment of multi-service data and AI applications across pipeline stages. Data teams define their application once in a YAML manifest, DevOps teams deploy with a single command, and the CLI handles configuration substitution, dependency ordering, and resource provisioning automatically. For details, see the CI/CD CLI documentation.

In this post, we walk through how the CI/CD CLI works, show you how to deploy a real application across environments, and demonstrate how it fits into your existing CI/CD workflows.

Customer spotlight

Bureau Veritas, a global leader in testing, inspection, and certification, operates across multiple SageMaker Unified Studio environments to support its data and AI teams. With their data and DevOps teams working on different parts of the application lifecycle, Bureau Veritas needed a controlled way to promote workloads from development through test to production while preserving clear ownership boundaries between the two teams.

“We need to promote data and AI applications across SageMaker Unified Studio environments in a controlled way that respects the boundaries between our data teams and our DevOps teams. The CI/CD CLI does exactly that — a single manifest from the data team, a single deploy command from DevOps, and full control over what goes to production.”

— Gilles Kempf, Architecture Manager, Bureau Veritas

How the CI/CD CLI works

The CI/CD CLI introduces a clean separation of concerns between data teams and DevOps teams.

Data teams define what to deploy in a declarative YAML manifest (manifest.yaml). The manifest describes the application’s resources, including AWS Glue extract, transform, and load (ETL) jobs, Athena queries, Airflow directed acyclic graphs (DAGs), Quick Sight dashboards, and SageMaker training jobs, along with stage-specific configurations for each environment.

DevOps teams define how and when to deploy using their existing CI/CD systems. They retain full control over their deployment methodology. They choose whether to promote content through git branches, a bundle artifactory, or both; they decide the shape of the pipeline, including which stages to include (dev, staging, pre-prod, prod) and which manual approvals or security gates are required. They run aws-smus-cicd-cli deploy inside GitHub Actions, Jenkins, or GitLab CI workflows without needing to understand which AWS services the application uses or how SageMaker Unified Studio projects are structured. The CLI is a utility for AWS analytics service deployment, not a CI/CD methodology. Your team’s existing conventions for branches, approvals, and pipeline shape stay exactly as they are.

The CLI is the abstraction layer between the two. It reads the manifest, substitutes stage-specific configurations (S3 paths, AWS Identity and Access Management (IAM) roles, account IDs, and connection strings), provisions resources in dependency order, and handles all AWS service interactions.The following diagram illustrates this separation:

Key concepts

Application manifest

Each stage maps to a dedicated SageMaker Unified Studio project. This one-stage-to-one-project mapping is the foundation of CI/CD isolation: each project has its own domain, IAM boundaries, connections, and data, so changes in dev can never affect prod. For stronger isolation, projects can span different AWS accounts and AWS Regions. For example, dev in a sandbox account and prod in a production account in a different Region. Because each stage is a real SageMaker Unified Studio project, teams can open it in the console at any time to observe workflows, inspect resources, and troubleshoot deployments. Project membership is managed per project, so you control exactly who has access to each stage. For example, developers in dev and a release team in prod.The manifest file is the single source of truth for your application. It declares:

Content: application code from git repositories, data files from S3, Quick Sight dashboards, and workflow definitions.
Stages: environment-specific project mappings (dev, test, prod, etc.), each isolated as described earlier.
Configuration: stage-specific settings that are substituted automatically at deploy time.

Here is an example manifest for an analytics application with AWS Glue ETL and Quick Sight:
applicationName: SalesAnalyticsDashboard

content: 
  storage: 
    - name: etl-code 
      include: ["*.py"] 
    - name: workflows 
      include: ["*.yaml"] 
  quicksight: 
    - name: SalesDashboard 
      type: dashboard 
  workflows: 
    - workflowName: sales_etl_pipeline 
      connectionName: default.workflow_serverless 
 
stages: 
  dev: 
    domain: 
      region: us-east-1 
    project: 
      name: analytics-dev 
    deployment_configuration: 
      storage: 
        - name: etl-code 
          connectionName: default.s3_shared 
          targetDirectory: sales/bundle/etl 
        - name: workflows 
          connectionName: default.s3_shared 
          targetDirectory: sales/bundle/workflows 
 
  prod: 
    domain: 
      region: us-west-2 
    project: 
      name: analytics-prod 
    deployment_configuration: 
      storage: 
        - name: etl-code 
          connectionName: default.s3_shared 
          targetDirectory: sales/bundle/etl 
        - name: workflows 
          connectionName: default.s3_shared 
          targetDirectory: sales/bundle/workflows 
      quicksight: 
        assets: 
          - name: SalesDashboard 
            owners: 
              - arn:aws:quicksight:${AWS_REGION}:${AWS_ACCOUNT_ID}:user/default/Admin/*

Each stage must map to a separate SageMaker Unified Studio project, providing full isolation between environments. The CLI substitutes variables like ${AWS_ACCOUNT_ID} and ${AWS_REGION} at deploy time based on the target environment.

Bundles

A bundle is an immutable, versioned archive of your application. The bundle command reads from a source stage (typically dev) and packages the application code, workflow definitions, and resolved configurations into a self-contained artifact. The deploy command then applies that artifact to one or more target stages (test or prod).

This stage-to-bundle-to-stage promotion model supports controlled rollout through quality gates:

# Package from dev 
aws-smus-cicd-cli bundle --manifest manifest.yaml 
 
# Deploy to test 
aws-smus-cicd-cli deploy --manifest app.tar.gz --targets test 
 
# Validate the test deployment 
aws-smus-cicd-cli test --manifest manifest.yaml --targets test 
 
# Promote the same bundle to prod 
aws-smus-cicd-cli deploy --manifest app.tar.gz --targets prod

The same artifact is deployed at every stage without rebuilding, providing audit trails and reproducible deployments for regulated industries.

SageMaker Catalog integration

The CLI manages Amazon SageMaker Catalog resources as part of the deployment process. You can define catalog assets, glossaries, glossary terms, form types, asset types, and metadata forms, in your manifest. During deployment, the CLI searches for assets in the catalog, creates subscription requests for required data access, and waits for approval before proceeding. This automates the data governance workflow that teams previously handled manually.

CLI commands

The CI/CD CLI provides commands that cover the full deployment lifecycle:

Command	Description
describe	Validates the manifest, checks that target projects exist, and confirms the execution role has required permissions. Use –connect to validate against live AWS environments.
bundle	Reads from a source stage and packages application code, workflow definitions, and configurations into an immutable, versioned archive.
deploy	Applies bundle contents to one or more target stages. Provisions resources in dependency order.
test	Runs post-deployment validation to confirm services are running and ready for workloads.
create	Generates a starter manifest from an existing SageMaker Unified Studio project.
run	Triggers Airflow workflow execution on MWAA or Airflow Serverless connections.
monitor	Monitors workflow execution status in real time.
logs	Fetches and streams workflow execution logs.
destroy	Removes deployed resources and projects for cleanup or failure recovery.

Walkthrough: deploying a Quick Sight dashboard with AWS Glue ETL

In this section, we walk through deploying an analytics application that uses AWS Glue for ETL, Athena for queries, and Quick Sight for dashboards. This example is available in the GitHub repository.

Use case

An analytics team owns a Sales Analytics Dashboard built on AWS Glue ETL, Athena, and Quick Sight. They want to promote changes from a development environment to production with reproducible builds, automated validation, and a clear approval gate between stages, without writing custom deployment scripts or exposing data engineers to AWS provisioning details.

Solution overview

We use a sample application from the CI/CD CLI GitHub repository that includes AWS Glue ETL scripts, an Airflow workflow definition, a Quick Sight dashboard bundle, and integration tests. A single manifest.yaml describes the application and its dev and prod stages. The CLI handles the full lifecycle: bundle the app from dev, deploy it to test, run validation, and promote the same immutable artifact to prod.

Prerequisites

Before you begin, make sure you have the following:

Python 3.8 or later.
AWS credentials with permissions to deploy to your SageMaker Unified Studio projects. For details on configuring credentials, see Configuration and credential file settings in the AWS CLI.
Existing SageMaker Unified Studio projects for your target stages.

Solution architecture

Each stage in the manifest maps to a dedicated SageMaker Unified Studio project (see the separation-of-concerns diagram in “How the CI/CD CLI works” earlier in this post). At deploy time, the CLI uploads ETL scripts and workflow definitions to the project’s S3 storage connection, provisions the Airflow workflow in MWAA Serverless, runs the workflow to create AWS Glue jobs and databases, and imports the Quick Sight dashboard. The same bundle artifact is applied to every downstream stage, ensuring dev, test, and prod stay in sync while remaining fully isolated.

Solution implementation

Step 1: Install the CLI

Install the CLI from PyPI:

pip install aws-smus-cicd-cli

Step 2: Create or customize a manifest

Clone the repository and start from the analytics example:

git clone https://github.com/aws/CICD-for-SageMakerUnifiedStudio.gitcd CICD-for-SageMakerUnifiedStudio/examples/analytic-workflow/dashboard-glue-quick

The example includes AWS Glue ETL scripts, an Airflow workflow definition, a Quick Sight dashboard bundle, and integration tests. Open manifest.yaml and update the project, domain, and deployment_configuration values under each stage so they match your own SageMaker Unified Studio projects and connection names.Alternatively, generate a manifest from an existing project: aws-smus-cicd-cli create --domain-id <your-domain-id> --dev-project-id <your-project-id>

Step 3: Validate your configuration

Run the describe command with --connect to verify your environment is ready. This connects to your AWS environment and validates that target projects exist, the execution role has the required permissions, and connections are reachable. Fix any issues before deploying.

aws-smus-cicd-cli describe --manifest manifest.yaml --connect

Step 4: Deploy

Run the deployment:

aws-smus-cicd-cli deploy --targets test --manifest manifest

During deployment, the CLI:

Uploads ETL scripts and workflow definitions to S3 using the project’s storage connection.
Creates the Airflow workflow in MWAA Serverless.
Runs the workflow, which provisions AWS Glue jobs, creates databases, and runs ETL transformations.
Imports the Quick Sight dashboard and refreshes datasets with the latest data.
Processes any catalog asset subscriptions defined in the manifest.

Step 5: Validate

Run post-deployment validation to confirm services are running and ready for workloads:

aws-smus-cicd-cli test --manifest manifest.yaml --targets test

Step 6: Promote to production

Promote the same bundle artifact that was validated in the test stage to production. This guarantees the exact same artifact runs in prod:

# Promote the same bundle that was validated in test to prod

aws-smus-cicd-cli deploy --manifest app.tar.gz --targets prod

Integrating with GitHub Actions

The CLI works with existing CI/CD solutions. The GitHub repository includes reusable workflow templates that DevOps teams can adopt directly.The following is an example of a GitHub Actions workflow that implements a full bundle-based deployment pipeline:

name: Deploy Analytics Application 
on: 
  push: 
    branches: [main] 
 
jobs: 
  deploy-test: 
    runs-on: ubuntu-latest 
    steps: 
      - uses: actions/checkout@v4 
 
      - name: Install CLI 
        run: pip install aws-smus-cicd-cli 
 
      - name: Configure AWS credentials 
        uses: aws-actions/configure-aws-credentials@v4 
        with: 
          role-to-assume: ${{ secrets.AWS_ROLE_ARN }} 
          aws-region: us-east-1 
 
      - name: Validate 
        run: aws-smus-cicd-cli describe --manifest manifest.yaml --connect 
 
      - name: Bundle 
        run: aws-smus-cicd-cli bundle --manifest manifest.yaml 
 
      - name: Deploy to test 
        run: aws-smus-cicd-cli deploy --targets test --manifest manifest.yaml 
 
      - name: Run tests 
        run: aws-smus-cicd-cli test --manifest manifest.yaml --targets test 
 
  deploy-prod: 
    needs: deploy-test 
    runs-on: ubuntu-latest 
    environment: production 
    steps: 
      - uses: actions/checkout@v4 
 
      - name: Install CLI 
        run: pip install aws-smus-cicd-cli 
 
      - name: Configure AWS credentials 
        uses: aws-actions/configure-aws-credentials@v4 
        with: 
          role-to-assume: ${{ secrets.AWS_PROD_ROLE_ARN }} 
          aws-region: us-west-2 
 
      - name: Deploy to production 
        run: aws-smus-cicd-cli deploy --targets prod --manifest manifest.yaml

The CLI also works with Jenkins, GitLab CI, and Azure DevOps. See the CI/CD integration guide for additional examples.

In the next section, we cover which AWS services and workload types the CLI supports.

Supported workloads

The CLI deploys applications that span the following AWS services through Airflow workflow definitions:

Analytics and BI: AWS Glue ETL jobs and crawlers, Amazon Athena queries, Amazon Quick Sight dashboards, Amazon EMR jobs, Amazon Redshift queries.
Machine learning: SageMaker training jobs, ML model endpoints, SageMaker AI Pipelines.
Code and workflows: Jupyter notebooks, Python scripts, Airflow DAGs (MWAA and MWAA Serverless).
Data and storage: S3 data files, Git repositories, SageMaker Catalog resources (glossaries, glossary terms, form types, asset types, assets, data products, metadata forms).

The examples directory includes working applications for each of these patterns, with manifests, workflow definitions, and integration tests.

Failure recovery

If a deployment fails, the CLI stops at the point of failure and reports the error with a detailed stack trace. To recover:

Run aws-smus-cicd-cli describe --connect to check which resources exist and which permissions are missing.
Fix the issue and rerun aws-smus-cicd-cli deploy.
For bundle-based deployments, redeploy a previous bundle version.
Use aws-smus-cicd-cli destroy --targets <target> --force to clean up a failed deployment.

For detailed rollback procedures, see the Rollback Guide.

Conclusion

In this post, you learned how the Amazon SageMaker Unified Studio CI/CD CLI gives data and DevOps teams a clean separation of concerns: data teams describe their application once in a YAML manifest, and DevOps teams deploy it with a single command through their existing CI/CD pipelines. You saw how stages map to isolated SageMaker Unified Studio projects (optionally spanning AWS accounts and Regions), how bundles provide immutable, reproducible promotion through test and production, and how the CLI integrates with GitHub Actions, Jenkins, GitLab CI, and Azure DevOps. You also walked through deploying a Glue-and-Quick-Sight analytics application from dev through to prod.

Get started

The CI/CD CLI is available at no additional cost in all AWS Regions where Amazon SageMaker Unified Studio is available. You pay only for the underlying AWS resources provisioned during deployment.

Use the following steps to try it out:

Install the CLI:
pip install aws-smus-cicd-cli
Browse the example applications for analytics and ML patterns.
Follow the CI/CD CLI documentation to deploy your first application in 10 minutes.
Review the Admin Guide for infrastructure setup.

For feedback and bug reports, open an issue on the GitHub repository.

About the authors

A systematic approach to benchmarking SQL processing engines on AWS

Anubhav Awasthi — Tue, 19 May 2026 15:44:02 +0000

Selecting the right SQL processing solution for large-scale data analytics is a critical decision for organizations. As data volumes grow exponentially, the technology landscape has evolved to offer diverse options for processing and analyzing this information efficiently. This post presents a systematic framework for evaluating and benchmarking SQL processing engines on AWS, using Apache JMeter to conduct practical performance testing at scale.

The AWS analytics ecosystem

AWS offers a rich portfolio of SQL processing solutions to meet various analytical needs:

Serverless query services – Amazon Athena is a serverless, interactive query service that uses standard SQL to analyze data in Amazon Simple Storage Service (Amazon S3), offering automatic scaling, parallel query execution, and pay-per-query pricing with no infrastructure management required
Data warehouse solutions – Amazon Redshift offers scalable, high-performance cloud data warehousing with serverless options, zero-ETL integrations, AI-powered query assistance, and seamless machine learning (ML) integration for modern analytics at scale
Managed open source engines – Amazon EMR supports Apache Spark SQL, Apache Trino (formerly PrestoSQL), and other distributed query frameworks
Self-managed options – You can deploy open source engines like Apache Spark, Apache Flink, and Trino on Amazon Elastic Kubernetes Service (Amazon EKS) for greater control
Partner solutions – You can access specialized big data analytics tools through AWS Marketplace

These options are further enhanced by modern open table formats such as Apache Iceberg, Delta Lake, and Apache Hudi, which bring crucial enterprise features like ACID (Atomicity, Consistency, Isolation, and Durability) transactions, schema evolution, and time travel capabilities to data lakes. These SQL processing solutions operate under the AWS Shared Responsibility Model. AWS manages the security of the underlying infrastructure and services, and customers are responsible for secure configuration, access management, and data protection within their testing environments. This division of responsibility remains important when evaluating and benchmarking different SQL engines. Proper security configuration and implementation by customers is essential for maintaining a secure analytics environment.

Evaluation challenges in SQL engine selection

The rich ecosystem of SQL processing options creates significant evaluation challenges. Each SQL engine employs unique architectural approaches and optimization strategies, making direct comparisons complex. Organizations embarking on this evaluation journey face several interconnected obstacles:

Creating environments that accurately reflect production scenarios
Developing test datasets that mirror real-world data characteristics and volumes
Replicating real-world query patterns and concurrency levels
Maintaining uniform testing conditions across different engine architectures
Controlling infrastructure expenses throughout the evaluation process

Performance considerations at petabyte scale

When evaluating solutions for petabyte-scale deployments, the complexity intensifies considerably. Several critical factors come into play:

Resource management – Distributed SQL engines require precise balancing of CPU, memory, and storage resources. Suboptimal resource allocation can lead to query failures and performance degradation, particularly as data volumes grow.
Data distribution patterns – How data is distributed across partitions or nodes significantly impacts query performance. Data skew can create processing bottlenecks, with some nodes handling disproportionate workloads while others remain underutilized.
Concurrency handling – High-concurrency environments demand sophisticated workload scheduling and resource isolation mechanisms. The ability to maintain consistent performance under varying concurrent loads becomes a critical differentiator between solutions.
Meaningful metrics – Performance evaluation at scale requires comprehensive metrics analysis:
- Mean, median, and percentile response times (particularly p90 and p95)
- Query throughput under varying concurrency levels
- Scalability characteristics across diverse workload types
- Resource utilization efficiency during peak loads

Limitations of traditional benchmarks

Although industry-standard benchmarks like TPC-DS and TPC-H provide valuable insights, our experience with multiple customer engagements has shown that tailored, workload-specific testing often reveals performance characteristics not captured by these standardized tests. This is especially true for complex, multi-tenant environments with diverse query patterns. Organizations that complement standard benchmarks with workload-specific testing typically experience shorter proof-of-concept cycles, optimized evaluation costs, and more efficient testing operations. This comprehensive approach helps reduce uncertainty in the final solution selection process.

Prerequisites

Before you dive into the evaluation process, make sure you have the following prerequisites:

An AWS account with appropriate permissions to create and manage Amazon Elastic Compute Cloud (Amazon EC2) instances and access the SQL engines you plan to benchmark.
Basic familiarity with AWS services, particularly Amazon EC2 and the SQL engines you intend to evaluate (such as Athena, Amazon Redshift, or Amazon EMR).
Experience with SQL and data analytics concepts.
Access to the SQL engines you choose to benchmark. This post assumes you’ve already set up the engines you want to test. For setup instructions, refer to the AWS documentation for each service.
A dataset suitable for your benchmarking needs. Dataset creation and loading are not covered in this post. Build petabyte-scale synthetic test data with Amazon EMR on EC2 provides prescriptive guidance to generate test datasets at scale. Make sure your test datasets are stored in S3 buckets with encryption enabled (using SSE-KMS or SSE-S3) and that all service connections use TLS for data in transit.

Benefits of Apache JMeter

As organizations scale their analytics workloads to petabyte levels, there is a growing need for a robust, structured approach to SQL query performance testing. Although many organizations develop custom testing frameworks or use various benchmarking tools, these approaches often lack standardization and can be difficult to replicate across different SQL engines. The complexity of modern data architectures, combined with the variety of available SQL processing solutions, demands a systematic evaluation methodology. Apache JMeter emerges as a powerful solution to address this challenge. Though traditionally known for web application testing, JMeter’s extensible architecture and robust feature set make it particularly well-suited for SQL performance testing at scale.JMeter offers several advantages for evaluating SQL engines:

Support for multiple protocols and connections
Ability to simulate complex concurrent workloads
Built-in performance metrics and reporting
Extensible architecture for custom testing scenarios
Integration capabilities with continuous integration and continuous delivery (CI/CD) pipelines

Through this proposed framework, which has been validated across multiple customer engagements at petabyte scale, we aim to help organizations make more informed decisions when selecting a SQL processing solution. Our experience working with customers to assess various AWS Analytics services and open source solutions has demonstrated that a systematic evaluation approach significantly reduces proof-of-concept cycles and optimizes resource investments. This framework has helped organizations effectively evaluate services like Athena, Amazon Redshift, and Amazon EMR, alongside open source solutions such as Trino on Amazon EKS, based on their specific workload profiles and performance requirements.With this methodology, organizations can accomplish the following:

Navigate the complex landscape of large-scale data processing technologies
Reduce proof-of-concept cycles from months to weeks
Minimize infrastructure costs during evaluation phases
Make data-driven decisions about technology selection
Better align technology choices with business requirements
Establish repeatable testing patterns for future evaluations

Testing methodology in practice

A successful SQL engine evaluation requires understanding and replicating real-world workload patterns. Our methodology, refined through numerous customer engagements, focuses on comprehensive testing across multiple dimensions while remaining adaptable to specific organizational needs.

Query pattern selection

We begin by selecting representative query patterns that mirror production workloads:

Aggregation queries that summarize large datasets using operations like SUM, AVG, and COUNT
Complex join operations that test the engine’s ability to combine data efficiently across multiple tables
String operations that evaluate text processing capabilities
Nested queries that assess the engine’s optimization capabilities for complex query structures

A carefully selected set of 8–10 queries typically provides sufficient coverage while keeping the evaluation manageable. These should reflect your actual workload characteristics and business requirements.

Data volume variations

Testing across different data volumes is important for understanding scalability characteristics. We structure our tests around varying data scan ranges:

Small-scale scans – Queries accessing 1–7 days of data (megabytes to gigabytes)
Large-scale scans – Queries spanning 14–30 days (terabytes to petabytes)

This approach evaluates both I/O efficiency with large datasets and metadata handling with smaller, frequent queries, helping understand how services like Amazon EMR, Amazon Redshift, or Athena optimize query execution across different access patterns.

Concurrency testing

Real-world analytics environments rarely process single queries in isolation. Our methodology incorporates the following features:

Progressive concurrency testing starting at lower levels (typically 16, 32, 64, and 128 parallel queries), though these numbers can be adjusted based on your test infrastructure capacity and specific requirements. We recommend starting with smaller concurrency levels and gradually scaling up to understand performance characteristics
Varied query complexity and frequency (referred to as query weights) to simulate realistic workload distributions. This means some queries are run more often or are more resource-intensive than others, mimicking real-world usage patterns.
Mixed query patterns running simultaneously to test resource management.
Consistent execution across different date ranges to evaluate scaling behavior.

This approach is particularly important when evaluating managed services like the workload management capabilities of Amazon Redshift or the resource allocation strategies of Amazon EMR.

Query weight distribution

Production environments typically see varying frequencies of different query types. Our framework incorporates weighted query distribution to simulate real-world scenarios more accurately. In a typical distribution, frequent lightweight queries might represent 60% of the workload, complex analytical queries might comprise 30%, and resource-intensive data processing operations might make up the remaining 10%.This weighted approach makes sure performance testing reflects actual usage patterns rather than artificial benchmarking scenarios. The exact distribution should mirror your organization’s specific workload patterns.

Sequential vs. concurrent testing

Our methodology implements two distinct testing phases:

Sequential testing – Establishes baseline performance metrics:
- Runs each query type independently across different date ranges
- Runs multiple iterations to provide consistency and identify variability
- Helps understand individual query performance characteristics
Concurrent testing – Simulates real-world multi-user scenarios:
- Implements weighted query distributions
- Tests different concurrency levels to identify scaling limitations
- Evaluates resource management capabilities of different engines

JMeter efficiently implements both testing phases while maintaining consistent test conditions across SQL engines. Its ability to handle various JDBC connections makes it particularly suitable for testing AWS analytics services.Through this structured approach, organizations can gather comprehensive performance data reflecting their specific use cases, enabling informed SQL engine selection decisions while maintaining core principles of systematic evaluation and realistic workload simulation.

Test plans

To evaluate SQL engines’ performance under varying workloads, we designed two test scenarios: sequential and concurrent execution plans. Each scenario was executed across different data volumes by adjusting the query date range filters to cover 1, 7, 14, and 30 days. These variations simulate typical analytical workloads with progressively increasing data sizes.For sequential runs, each test was treated as a distinct batch, grouping all queries (Query 1 to Query 9) under the same date range—each query will scan data for 1, 7, 14, and 30 days with appropriate date filtering in the query’s where predicate. We used JMeter to capture average query response times for each batch. This configuration was run three times, and the final metrics reflect the average response time across these iterations to ensure reliability and account for environmental variance.Although three iterations provide initial insights, if you observe significant variations in results (typically more than 10% deviation between runs), consider expanding to 10 or more iterations. This additional sampling helps establish statistical significance, identify true performance patterns, and distinguish outliers (beyond three standard deviations) from normal variations. Document any consistent anomalies, because they may indicate important performance or security considerations for your specific environment.The following table shows the sample test plans template for the sequential test plan run.

Dataset Time Range	Run	Query Weights
Dataset Time Range	Run	Query 1	Query 2	Query 3	Query 4	Query 5	Query 6	Query 7	Query 8	Query 9
1 day	Run 1
	Run 2
	Run 3
	Avg
7 days	Run 1
	Run 2
	Run 3
	Avg
14 days	Run 1
	Run 2
	Run 3
	Avg
30 days	Run 1
	Run 2
	Run 3
	Avg

For the concurrent test plan, we introduced a probabilistic weighted distribution to the queries (Query 1 to Query 9), simulating a more realistic production-like environment where query frequency varies based on business relevance and usage patterns. This added a layer of complexity to better reflect how the SQL engine would perform under real-world concurrent access patterns.The following table shows the sample test plans template for the concurrent test plan run.

Dataset Time Range	Concurrent Runs	Query Weights
Dataset Time Range	Concurrent Runs	Query 1	Query 2	Query 3	Query 4	Query 5	Query 6	Query 7	Query 8	Query 9
1 days	8	11%	11%	11%	11%	11%	11%	11%	11%	11%
	16	10%	5%	24%	5%	5%	5%	24%	14%	10%
	32	8%	3%	24%	5%	5%	5%	24%	16%	8%
	64	7%	3%	24%	6%	4%	6%	26%	16%	9%
	128	1%	4%	19%	8%	5%	7%	14%	20%	22%
*7 days	8	11%	11%	11%	11%	11%	11%	11%	11%	11%
	16	10%	5%	24%	5%	5%	5%	24%	14%	10%
	32	8%	3%	24%	5%	5%	5%	24%	16%	8%
	64	7%	3%	24%	6%	4%	6%	26%	16%	9%
	**128	1%	4%	19%	8%	5%	7%	14%	20%	22%
14 days	8	11%	11%	11%	11%	11%	11%	11%	11%	11%
	16	10%	5%	24%	5%	5%	5%	24%	14%	10%
	32	8%	3%	24%	5%	5%	5%	24%	16%	8%
	64	7%	3%	24%	6%	4%	6%	26%	16%	9%
	128	1%	4%	19%	8%	5%	7%	14%	20%	22%
30 days	8	11%	11%	11%	11%	11%	11%	11%	11%	11%
	16	10%	5%	24%	5%	5%	5%	24%	14%	10%
	32	8%	3%	24%	5%	5%	5%	24%	16%	8%
	64	7%	3%	24%	6%	4%	6%	26%	16%	9%
	128	1%	4%	19%	8%	5%	7%	14%	20%	22%

For example, for configuration of *7 days concurrent run with **128 concurrency, the proposed configuration distributes Query 1 to Query 9 with appropriate weighted submissions such that Query 9 is executed the greatest number of times in the overall 128 executions submitted across all 9 queries for this run.

JMeter setup

To begin, you must set up JMeter on a machine that can handle the desired test load. An EC2 instance is a flexible and cost-effective option. Choose an instance type with sufficient vCPUs to support your maximum planned concurrency. For example, a c6i.4xlarge or higher is typically suitable for moderate to high throughput testing scenarios. For the operating system, you might choose Amazon Linux, which is optimized for AWS. For production-grade testing environments, deploy the JMeter EC2 instance in a private subnet of a virtual private cloud (VPC) with appropriate security groups that allow only required connections. This network isolation helps maintain security while executing performance tests. Consider using Amazon Virtual Private Cloud (Amazon VPC) endpoints for secure access to AWS services.

After the instance is provisioned, install Java (Java 17 LTS or Java 21 LTS) and download the latest version of JMeter. Be sure to configure the system with appropriate JVM options to allocate sufficient heap memory for large-scale test executions. Refer to Getting Started to learn more.

# Install Java
sudo yum update -y # For Amazon Linux
sudo yum install java-17-amazon-corretto -y

# Download JMeter and place the appropriate jdbc driver for the engine of your selection under lib folder
wget https://downloads.apache.org//jmeter/binaries/apache-jmeter-5.6.3.tgz
tar -xvzf apache-jmeter-5.6.3.tgz
cd apache-jmeter-5.6.3/lib

# Launch JMeter in GUI mode (if using a GUI-capable setup) or use CLI for remote testing
./bin/jmeter

JMeter concepts

Before you create test plans in JMeter, it’s important to understand a few foundational concepts that influence how your test plan behaves—such as thread groups, user-defined variables, and JDBC connection. These components enable the simulation of real-world query loads, including concurrency and pacing.

Test plans

The test plan is the top-level container for a JMeter test. It defines the overall testing strategy, including the queries to execute, their parameters, and the concurrent user behavior. These plans are represented as jmx files that can then be used for CLI-based execution. JMeter supports both GUI and CLI modes. It is highly recommended that you use the JMeter GUI primarily for creating test plans as jmx, and use the CLI for large load tests. You can also run thread groups consecutively for sequential execution. The default behavior is to run all thread groups in parallel suited for concurrent execution. Refer to Building a Test Plan to learn more about options available with test plans.

User-defined variables

User-defined variables are global parameters that you can reuse throughout the test plan. They are helpful for defining database credentials, server URLs, or query parameters. For example:DB_URL=jdbc:trino://trino-cluster.example.com:8889?SSL=true #Enable SSL/TLS

You can configure authentication (user name and password) through your organization’s approved methods, such as AWS Secrets Manager (see Move hardcoded secrets to AWS Secrets Manager) AWS Identity and Access Management (IAM) roles, or other secure credential management systems.

Thread groups

A thread group represents a group of virtual users (threads) executing test actions. Each thread simulates a single user sending requests to the SQL engine. This can be used to simulate concurrent runs. For example, in the preceding template, Query 3 has 19% weightage across 128 runs. This means .19*128=25 total runs, so we set the thread group to 25.

JDBC connection configuration

JDBC connection configuration sets up the database connection for the test. It specifies the database URL, driver, and credentials required for executing SQL queries. Key fields to configure are database URL and JDBC driver class. The following table summarizes the different configuration settings.

SQL Engine	JDBC Driver	JDBC Driver Class
Trino on EMR	`trino-jdbc-<trino_version>-amzn-0.jar`	`io.trino.jdbc.TrinoDriver`
Athena	Athena JDBC 3.x driver	`com.amazon.athena.jdbc.AthenaDriver`
Amazon Redshift	Amazon Redshift JDBC driver	`com.amazon.redshift.jdbc.Driver`
Trino on EKS	Trino JDBC driver	`io.trino.jdbc.TrinoDriver`

JDBC requests

The JDBC request executes SQL queries against the database using the configuration defined in the JDBC connection configuration.

For example, following command runs the JMeter in CLI mode:

# Run benchmarks in CLI mode 
./jmeter -n -t <path_to>.jmx -l <local path for log>.log -e -o <local path for>/output/

The output folder will contain an HTML report with different statistics. The following screenshot illustrates 128 concurrent runs.

Monitoring and logging

For comprehensive visibility and audit requirements, enable AWS CloudTrail logging, VPC Flow Logs, and service-specific logs (like Amazon S3 access logs). These logs can be centralized in Amazon CloudWatch Logs for monitoring and analysis. This provides proper audit trails while evaluating different SQL engines and helps track access patterns and potential security events.

Post-test steps

After running your JMeter tests, proceed with the following steps:

Review the HTML report’s key metrics, including response times, throughput, and error rates across different query types and concurrency levels.
Run identical test plans across your candidate SQL engines for direct performance comparison.
Refine your test plans based on initial findings, focusing on areas where performance differences are significant.
Factor in the cost implications alongside performance metrics to make a balanced decision.

These steps can help you systematically evaluate and select the most suitable SQL engine for your analytics workloads.

Resources

In the preceding steps, we walked through a UI-based setup for JMeter along with test plans. We have created a few sample JMeter test plans for both sequential and concurrent runs along with sample test reports. You can modify the plans to fit your needs.

JMeter sample report
JMeter test plan for sequential run
JMeter test plan for concurrent run

Clean up

After you complete your benchmarking process, clean up the resources to avoid unnecessary costs:

Stop or delete the EC2 instances used for running JMeter.
Depending on which SQL engines you used for testing, clean up active resources.
Review your AWS Management Console to confirm no active resources remain.
If you created test datasets in Amazon S3 or other storage services specifically for this benchmarking, consider deleting them if they’re no longer needed.
Although JMeter test plans and results don’t incur AWS costs, organize or delete local files as needed for your record-keeping.

Summary

Selecting the right SQL processing solution for large-scale analytics demands a systematic, data-driven approach. Our JMeter framework can help organizations effectively evaluate different SQL engines by simulating real-world workload patterns across various query types, data volumes, and concurrency levels. This methodology reduces proof-of-concept cycles and provides insights beyond traditional benchmarks, helping you assess managed AWS services like Athena and Amazon Redshift and open source solutions on Amazon EKS.

About the authors

Build petabyte-scale synthetic test data with Amazon EMR on EC2

Anubhav Awasthi — Tue, 19 May 2026 15:42:49 +0000

As you scale your data systems, you face a challenge: how to test thoroughly without putting customer data at risk. Using production data for testing can expose sensitive customer information to unauthorized access or breaches. For customers in regulated industries like finance and healthcare, this risk isn’t only a concern. It’s unacceptable. A data breach during testing could compromise their privacy, damage their trust, and expose organizations to significant compliance penalties. Synthetic test data solves this problem by generating artificial datasets that replicate the structure and patterns of real data without containing any actual customer information. This approach means you can test performance, validate data pipelines, and develop new features while ensuring that customer data remains protected and compliance requirements are met.

As data volumes grow from terabytes to petabytes, the architecture for generating synthetic data must evolve to meet increasing demands for scale, performance, and data quality. In this post, we show how you can build a scalable synthetic data generation solution using Amazon EMR, Apache Spark, and the Faker library.

The challenge of synthetic data generation

Traditional benchmark datasets like TPC-DS provide standardized schemas and predetermined data volumes for consistent testing environments across different systems. However, they fall short in meeting real-world testing requirements. These benchmarks don’t capture industry-specific patterns or the complex relationships found in actual production data. Their rigid schemas and simplified distributions fail to reflect business requirements, and scaling them while maintaining data consistency proves difficult. Perhaps most critically, generating massive datasets with traditional approaches requires specialized architectures to avoid proportional increases in compute costs and time.

Requirements for production-grade synthetic data

Effective workload validation demands synthetic data that mirrors production distributions while maintaining referential integrity across related tables and entities. The generation process must scale horizontally to accommodate growing data volumes while delivering deterministic results. Given identical input parameters, the system should produce the same dataset across multiple runs, supporting consistent testing cycles and comparative analysis.

Beyond technical requirements, synthetic data addresses compliance needs by minimizing exposure of personally identifiable information (PII) and protected health information (PHI) in non-production environments. This approach satisfies GDPR, HIPAA, and CCPA requirements while supporting secure cross-border data transfer, regular stress testing without compromising sensitive information, and providing an audit-friendly alternative to data masking that preserves analytical properties.

Solution overview

Architecting a synthetic data generation system that scales from terabytes to petabytes requires balancing several competing demands: the system must scale horizontally while maintaining data quality, generate large volumes efficiently, manage compute and storage resources cost-effectively, and support various schemas and output formats.

Our architecture addresses these challenges through four core components. Apache Spark on Amazon EMR provides the distributed computing framework necessary for large-scale generation. The Faker library offers synthetic data generation functions that integrate with Spark. Amazon Simple Storage Service (Amazon S3) with Apache Iceberg serves as the storage layer. We chose Iceberg for its schema and partition evolution capabilities without data rewrites, atomic transactions for consistency, precise time travel features for reproducible testing, and optimized performance at extreme scale. Amazon EMR handles dynamic resource allocation and cluster management.

The following diagram illustrates the solution architecture.

Synthetic data generation at scale with Amazon EMR

Amazon EMR emerges as a particularly powerful solution for this use case, offering several advantages that directly address our requirements. It facilitates scaling of compute resources through instance fleets and Spot Instances, which can reduce costs by up to 90% compared to On-Demand pricing. The service provides built-in performance optimization for Spark applications with real-time monitoring through Amazon CloudWatch integration.

The managed infrastructure reduces operational overhead by handling the underlying Spark ecosystem and cluster lifecycle, while still providing control over scaling policies, instance types, and configurations. Integration with Amazon S3, AWS Glue, and Amazon Athena facilitates end-to-end data generation and testing workflows. Support for multiple programming languages and notebooks provides flexibility in implementing generation logic tailored to specific testing scenarios.

The synthetic data generation process follows a systematic approach designed for efficiency and scalability, as illustrated in the following diagram.

Although synthetic data generation isn’t a sensitive workload, it’s important to maintain robust security throughout the data generation process. Amazon EMR provides security features that align with organizational compliance requirements.

For comprehensive security guidance specific to Amazon EMR deployments, refer to Security in Amazon EMR. The solution follows the AWS Shared Responsibility Model, where AWS manages the security of the cloud infrastructure, and customers maintain responsibility for data security, access management, and compliance controls in the cloud. Specifically for synthetic data generation workloads, AWS manages the security of the underlying Amazon EMR infrastructure, network, and service operations, and customers implement appropriate security controls for their data generation pipelines. Consider the following key areas:

Data protection – Enable encryption at rest and in transit using Amazon EMR security configurations, including Amazon S3 encryption and TLS certificates for inter-node communication.
Network security – Deploy Amazon EMR clusters in private subnets with security groups following least privilege, and enable the Amazon EMR block public access feature.
Access control – Implement AWS Identity and Access Management (IAM) roles with least privilege for Amazon EMR service roles, Amazon Elastic Compute Cloud (Amazon EC2) instance profiles, and runtime roles to isolate job access. Fine-grained table-level and column-level permissions can be controlled using AWS Lake Formation. Additional authentication options are available using Kerberos and LDAP.

Optimize Faker for petabyte-scale data generation

When generating synthetic data at petabyte scale, using Faker’s implementations can quickly lead to performance bottlenecks. To overcome these limitations, adopt a combination of different optimization approaches instead of the default setup. Some of the approaches we adopted in this scenario are discussed in this section.

Faker instance pooling

The following code creates multiple Faker instances to avoid contention when generating data in parallel:

NUM_FAKER_INSTANCES = 10
faker_pool = [Faker() for _ in range(NUM_FAKER_INSTANCES)]

Consistent seed management

The following code provides reproducible data generation across distributed executors:

for faker in faker_pool:
    faker.seed_instance(42)  # For reproducibility
    random.seed(42)

Random access to Faker pool

The following code distributes load across multiple Faker instances to reduce contention:

faker = faker_pool[random.randint(0, NUM_FAKER_INSTANCES-1)]

Broadcast variables for reference data

The following code efficiently distributes reference data to all executors:

tenant_ids_broadcast = spark.sparkContext.broadcast(tenant_ids)
protocols_bc = spark.sparkContext.broadcast(protocols)

Batch generation of synthetic data

The following code generates fake data in batches rather than one-by-one:

return spark.range(1, num_endpoints + 1)
    .withColumn("hostname", random_hostname_udf())

ThreadPoolExecutor for parallel processing

The following code uses Python’s threading for parallel operations within executors:

def parallel_write_with_sync(dataframe_configs, max_workers=3):
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        # Parallel processing

Optimize Amazon EMR and Spark

When processing massive datasets with Spark on Amazon EMR, carefully tuning configurations can substantially enhance performance beyond the standard settings. In this section, we discuss ways to optimize the execution environment, so you can efficiently handle petabyte-scale workloads with synthetic data generation. By strategically using Spark’s advanced features and configuring Amazon EMR for your specific use case, you can improve throughput, reduce processing time, and maximize resource utilization.

Arrow configuration

The following code enables Apache Arrow for efficient data transfer between Python and JVM. The default value is false.

.config("spark.sql.execution.arrow.pyspark.enabled", "true")

Enable this configuration when your PySpark application frequently converts data between Python and JVM, especially for large DataFrames or when using Pandas operations. Keep this setting disabled for pure Spark SQL workloads or when memory is constrained.

This optimization is most effective in the following scenarios:

When processing large-scale datasets that require frequent conversion between Python and JVM.
In a PySpark application where large DataFrame operations and Pandas integration are needed.
With data science workloads that combine Python UDFs with Spark SQL operations.

Consider the following trade-offs:

Arrow maintains in-memory columnar format, resulting in increased memory consumption.
Not all data types are fully supported in older versions of Spark.
It might introduce overhead for very small datasets where conversion costs outweigh the benefits.

Adaptive query execution

The following code allows Spark to dynamically optimize query execution plans. The default value is true in Spark 3.2 and later, and false in earlier versions.

.config("spark.sql.adaptive.enabled", "true")

This optimization is generally recommended to keep enabled for most workloads. Consider disabling only when you have highly optimized, predictable queries where the adaptive overhead isn’t beneficial, or when troubleshooting query performance issues.

This optimization is most effective in the following scenarios:

Complex join operations with unknown or skewed data distributions.
Multi-stage queries where initial plans might be suboptimal.
When processing data with changing characteristics over time.

Consider the following trade-offs:

You may experience additional overhead during the query planning phase.
You might occasionally choose suboptimal plans for certain edge cases.

Parallelism configuration

The following code sets appropriate parallelism for distributed data processing based on the volume of data you’re generating. The default value for spark.default.parallelism is the total number of cores on all executor nodes or 2, whichever larger. The default value for spark.sql.shuffle.partitions is 200.

.config("spark.default.parallelism", 1000)
.config("spark.sql.shuffle.partitions", 1000)

Adjust this configuration when the default of 200 shuffle partitions creates too many small tasks (increase data volume) or too few large tasks (decrease for smaller datasets). Generally, aim for partition sizes of 100–200 MB. Modify default.parallelism when your RDD operations need different parallelism than the CPU-based default.

This optimization is most effective in the following scenarios:

When generating consistent volumes of synthetic data across multiple runs.
When you have predictable resource requirements.
When you need to precisely control executor utilization.

Consider the following trade-offs:

Static configuration might not adapt well to varying data volumes.
Too many partitions can lead to task scheduling overhead.
Too few partitions might cause memory pressure on executors.

Memory management

The following code optimizes memory allocation for execution and storage. The default value for spark.memory.fraction is 0.6, and for spark.memory.storageFraction is 0.5.

.config("spark.memory.fraction", 0.8)
.config("spark.memory.storageFraction", 0.3)

Increase memory.fraction from 0.6 to 0.8 when your workload is memory-intensive and you’re not using the JVM heap for other purposes. Adjust storageFraction based on your caching vs. execution memory needs. Decrease to 0.3 if you do minimal caching but have complex computations, and increase to 0.7 or higher for cache-heavy workloads.

This optimization is most effective in the following scenarios:

Workloads that are memory-intensive and need fine-grained control.
Workloads that balance between execution memory and cached data.
During synthetic data generation that has many interdependent fields.

Consider the following trade-offs:

Incorrect memory configuration can lead to frequent spills to disk or out-of-memory (OOM) errors.
You might need to change the configuration to suit different workload characteristics.
The settings must be monitored and tuned for optimal performance.

Limited Python UDF usage

The following code uses Spark’s built-in functions where possible instead of Python user-defined functions (UDFs). No additional configuration is needed. This is a coding practice.

.withColumn("risk_score", F.round(F.rand() * 9 + 1, 2).cast(DecimalType(3, 2)))

We recommend using Spark functions over Python UDFs when the same functionality can be achieved. Use Python UDFs only when complex business logic can’t be expressed using Spark’s built-in functions, or when integrating with specialized Python libraries.

This optimization is most effective in the following scenarios:

Simple transformations that can be performed using Spark functions.
High-throughput workloads where serialization overhead needs to be minimized.

Consider the following trade-offs:

This approach is less flexible compared to customer Python-based transformations or functions.
You might need to use complex expressions to accomplish certain data patterns.
There is a potential learning curve to familiarize yourself with Spark functions.

DataFrame caching

The following code caches frequently used DataFrames to avoid regenerating data. The default behavior doesn’t use caching. DataFrames are recomputed on each action.

endpoints_df = generate_endpoints().cache()

Use this optimization to cache DataFrames that are accessed multiple times in your application. Monitor memory usage and use MEMORY_AND_DISK storage level for large DataFrames. Uncache DataFrames when they’re no longer needed to free memory.

This optimization is most effective in the following scenarios:

When reusing reference data across multiple operations (can result in performance gains).
For workloads where the same data is processed on multiple occasions.

Consider the following trade-offs:

Too much caching might lead to memory process.
Planning is required to manage cache in environments where memory is scarce.

Optimal partitioning

By default, Spark determines partitioning based on input data and previous operations. The following code makes sure data is properly distributed across executors:

.repartition(20)

Use repartition() when you need to increase partitions for better parallelism or support even data distribution. Use coalesce() when reducing partitions to avoid small files. Generally, target 100–200 MB per partition for optimal performance.

This optimization is most effective in the following scenarios:

When controlling data distribution and avoiding data skew is very important.
Before executing an expensive operation that will benefit from balanced data distribution.
When optimizing downstream consumption use cases.

Consider the following trade-offs:

This option is more expensive than coalesce(). For large datasets, repartition() can lead to large shuffle.
The approach requires trial and experimentation to determine the optimal partition count.
There is no “one-size-fits-all” setting. Different applications or operations might gain performance with different partitioning.

Partition-aware writing

By default, data is written without partitioning. The following code organizes data for efficient storage and retrieval:

{"df": network_events_df, "name": "network_events", "partition_cols": ["tenant_id"]}

Partition data when you have predictable query patterns that filter on specific columns. Choose partition columns that are frequently used in WHERE clauses and have reasonable cardinality (avoid too many small partitions or too few large ones).

This optimization offers the following benefits:

Allows for highly parallel write operation across multiple executors.
Organizes the data that is close to real-world production data.
Allows for partition pruning when querying the data.

Consider the following trade-offs:

Excess partitioning or too fine-grained partitioning might result in small files.
It might result in data skew because of hot partitions.
You might encounter storage and metadata overhead because of excessive partitions.

Best practices

Through our journey from terabytes to petabytes, we’ve identified several best practices:

Begin with a modest dataset and incrementally scale, allowing for identification of bottlenecks at each stage.
Implement robust data validation checks to confirm synthetic data maintains expected properties at scale.
Regularly review and adjust Amazon EMR configurations, using Spot Instances and right-sizing clusters.
Develop parameterized job scripts that can adjust data volume, complexity, and cluster resources dynamically.
Design your synthetic data schema and generation logic to quickly accommodate new fields or changing distributions over time.

Conclusion

Our journey from terabytes to petabytes of synthetic data generation demonstrates how Amazon EMR, combined with Spark and Faker, can effectively address large-scale testing needs. The architecture we explored in this post scales to meet demanding data generation requirements while maintaining data quality and cost-efficiency.

We showed how starting with a solid foundation at terabyte scale, then gradually expanding through Amazon EMR managed services and Spot Instances, helps organizations build robust synthetic data pipelines. The combination of efficient data generation techniques, proper validation, and continuous monitoring provides reliable results at scale.

To begin implementing your own synthetic data generation system, start small, test thoroughly, and scale incrementally. For implementation guidance, refer to Generate production-grade synthetic data at petabyte-scale using Apache Spark and Faker on Amazon EMR.

About the authors

Meet Amazon Redshift RG – AWS Graviton-based instances with an integrated data lake query engine delivering up to 2.4x better performance at 30% lower price than RA3

Ankit Sahu — Tue, 19 May 2026 15:38:08 +0000

On May 12, 2026, we announced the general availability of Amazon Redshift RG instances, powered by AWS Graviton processors. RG instances are up to 2.2x as fast for data warehouse workloads and up to 2.4x as fast for data lake workloads, all at 30% lower price per vCPU compared to RA3 instances. RG instances support all data lake formats supported by RA3 and eliminate Amazon Redshift Spectrum’s per-TB scanning charges. RG instances feature a custom-built integrated vectorized query engine, making them a more performant and cost-effective foundation for unified analytics.

We are launching with two instance sizes: rg.xlarge and rg.4xlarge, with additional sizes coming later this year.

Why we built this

RG instances bring the power of AWS Graviton processors to Amazon Redshift Provisioned clusters for the first time, paired with a purpose-built vectorized query engine. By combining Graviton’s superior price-performance with the latest Amazon Redshift innovations, RG instances deliver a step-change improvement across two dimensions: significantly lower cost and meaningfully faster performance for both warehouse and data lake workloads using Apache Iceberg and Apache Parquet. We built RG to help you avoid choosing between performance and economics. Graviton costs less to operate, and we’re passing that benefit to you while simultaneously raising the performance bar. Equally important, we designed RG to maintain full feature parity with RA3, so you can modernize your existing clusters without rearchitecting workloads or sacrificing capabilities you depend on today.

This combination is also increasingly critical for agentic artificial intelligence (AI) workloads. AI agents operating at scale generate a new class of analytics demand: high volumes of unique, unpredictable queries that require fast, low-latency responses to keep agents productive. Traditional price-performance ratios make running these workloads at scale cost-prohibitive. RG instances address this head-on. Lower per-vCPU pricing makes sustained high-query volumes economically viable, while improved query performance makes sure agents get answers fast enough to remain effective. Together, this provides the foundation for AI-driven analytics at the scale and economics that agentic workloads demand.

What’s new

RG instances: Better performance, lower cost

RG instances run on AWS Graviton, Amazon’s custom-designed cloud processor built from the ground up to deliver superior price-performance and energy efficiency. This translates directly into RG instances offering more compute cores, higher memory bandwidth, and lower inter-process communication latency compared to RA3, with performance improvements across warehouse, data lake, and mixed workloads.

Graviton costs less to operate, and we’re passing that benefit directly to you. RG instances are priced at a 30% lower cost per vCPU compared to RA3. Reserved Instance pricing follows the same model, making RG Reserved Instances equally 30% less costly than RA3. For pricing details, visit the Amazon Redshift pricing page.

Performance results

RG instances deliver faster, more efficient analytics across your most demanding warehouse and data lake workloads, whether you’re querying structured data in Amazon Redshift Managed Storage (RMS), running analytics over Iceberg tables in Amazon Simple Storage Service (Amazon S3), or processing Parquet files at scale. Iceberg workloads see the most significant gains, delivering up to 2.4x faster query execution. Parquet workloads deliver up to 1.5x faster query execution, and RMS-based data warehouse workloads deliver up to 2.2x faster query execution. All performance improvements are measured using industry-standard TPC-DS and TPC-H benchmarks at 10 TB scale on rg.4xlarge instances.

When combined with RG’s 30% lower per-vCPU pricing compared to RA3, these performance gains translate to even greater price-performance improvements, delivering more analytics value for every dollar spent.

Built-in data lake query engine – no more Spectrum charges

With RA3, data lake queries were offloaded to a separate fleet of nodes called Amazon Redshift Spectrum, scanning data externally and returning results back to the cluster. This architecture introduced network overhead, added latency, and imposed a $5/TB scanning charge on every query. RG instances change this fundamentally with a custom-built vectorized data lake engine running directly inside the cluster, eliminating Spectrum scanning charges.

The purpose-built vectorized engine includes a highly optimized scan layer that implements the latest data pruning techniques, a purpose-built I/O subsystem, and a range of optimizations that use Graviton’s processing capabilities to make scanning Iceberg and Parquet data highly efficient. Beyond raw scan performance, the engine introduces JIT ANALYZE, a capability that automatically collects and uses statistics for data lake tables during query execution. This eliminates the need for manual statistics collection. The system uses intelligent heuristics to identify queries that will benefit from statistics, maintains lightweight sketch data structures, and builds high-quality table-level and column-level statistics, all transparently. Having up-to-date statistics on data lake tables can deliver orders-of-magnitude improvements in query performance, and with JIT ANALYZE, you get this benefit automatically without operational overhead.

What customers are saying

Sean Lynch, Vice President, Data and Architecture, Southwest Airlines:

“Amazon Redshift RG instances have the potential to deliver meaningful business impact for Southwest Airlines. Based on initial testing in our development environment, our data warehouse workloads run 50-60% faster, and data lake analytics are 45% faster, enabling teams to get insights sooner, respond to operational conditions faster, and make data-driven decisions with less latency. These early results are encouraging, and we are excited to validate and scale these improvements in production. All of this comes without per-terabyte Spectrum scanning charges, delivering 30% lower cost than RA3 at a time when fuel prices continue to pressure industry margins.”

Akshay Srinivasan, Data Engineer, tombola:

“The new Graviton-based Amazon Redshift RG instances delivered 1.8x-2x faster write throughput and up to 2.2x faster read speeds compared to RA3 across a diverse set of batch and analytical jobs, enabling us to process 40% more within the same window. Compressed ETL cycles, accelerated time-to-insight, and decision-making no longer bottlenecked by the pipeline. Together, these translated directly into fresher data reaching our analysts and business teams sooner. What made this even more compelling was a concurrent 30% reduction in compute spend alongside the gains. Delivering more for less is a rare outcome, and one worth highlighting. In a volume-heavy gaming industry at tombola, where query latency and cost compound at scale, this has been one of the more impactful platform decisions we’ve made this year.”

Modernizing your workloads to RG

Today, we are launching rg.xlarge and rg.4xlarge instance sizes, available now for you to modernize your existing Amazon Redshift provisioned workloads. RG instances support three migration paths, all accessible directly from the AWS Management Console:

Elastic Resize (recommended): The fastest path for most customers migrating from RA3 or DC2, with only 10-15 minutes of downtime.
Snapshot & Restore: Best for you if you need to make configuration changes as part of your migration.
Classic Resize: Available for workloads that require a full cluster rebuild.

Before migrating your production workloads, we strongly recommend validating your queries and workloads on RG instances first. We’ve published an Upgrade Guide to help you right-size your cluster and plan your migration with confidence.

Getting started

You can start using the RG instances (rg.xlarge and rg.4xlarge) today in the following AWS Regions: US East (N. Virginia), US East (Ohio), US West (Oregon), US West (N. California), Canada (Central), South America (São Paulo), Europe (Ireland), Europe (Frankfurt), Europe (London), Europe (Paris), Europe (Stockholm), Europe (Milan), Europe (Spain), Asia Pacific (Tokyo), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Mumbai), Asia Pacific (Jakarta), Asia Pacific (Hong Kong), Asia Pacific (Osaka), Asia Pacific (Malaysia), Asia Pacific (Hyderabad), Asia Pacific (Taipei), and Asia Pacific (Melbourne).

You can launch new clusters or migrate existing clusters through the AWS Management Console, AWS Command Line Interface (AWS CLI), or AWS API.

To create a new RG cluster in the Amazon Redshift console

Review the Cluster and Nodes in the Amazon Redshift documentation.
Choose Amazon Redshift on the AWS Management Console and choose Create Cluster.
In the Create Cluster screen, choose the required RG node type.

To modernize from RA3 or DC2 in the Amazon Redshift console

Review the Upgrade Guide in the Amazon Redshift documentation.
Choose your migration path. Elastic Resize is the right starting point for most customers.
Choose the required RG node type.

For pricing details, visit the Amazon Redshift pricing page.

Clean up

If you are evaluating RG instances in a test or development environment and do not wish to continue, you can delete your RG cluster directly from the AWS Management Console or by using the AWS CLI to avoid incurring additional charges. If you used Snapshot & Restore to create a test RG cluster alongside your existing RA3 cluster, make sure you delete the RG cluster and any associated snapshots you no longer need. If you are using Data Sharing during migration, remember to remove data shares and decommission your RA3 cluster after you have fully validated your workloads on RG.

Conclusion

Amazon Redshift RG instances represent a significant step forward for you if you run data warehouse and data lake workloads on AWS. By bringing AWS Graviton processors to Amazon Redshift Provisioned clusters for the first time, paired with a purpose-built vectorized native data lake engine, RG instances deliver up to 2.4x better performance on Iceberg workloads, up to 1.5x on Parquet, and up to 2.2x on RMS data warehouse workloads, all at 30% lower per-vCPU cost than RA3. The elimination of Amazon Redshift Spectrum scanning charges makes data lake query costs predictable for the first time.

To get started with RG instances, visit the Amazon Redshift RG documentation to assess your workload and plan your migration.

Resources

Questions or feedback? Drop a comment or join the discussion on AWS re:Post.

About the authors

OpenSearch Agent Skills bring built-in intelligence to your agentic IDE

Bobby Mohammed — Mon, 18 May 2026 19:15:11 +0000

Today, we’re launching OpenSearch Agent Skills, a repository of open, composable skills that bring built-in intelligence to developer workflows with OpenSearch, directly inside your favorite agentic IDE. By embedding OpenSearch expertise into the developer’s existing workflow, Agent Skills reduce setup time, eliminate unnecessary tool-hopping, and let teams focus on building rather than configuring.

Developers today can go from idea to working prototype in minutes using agentic IDEs like Claude, Cursor, and Kiro. They can spin up applications, generate APIs, and build end-to-end workflows with a prompt. But whether you’re experimenting with a new idea, building a POC, or running production systems, the experience quickly becomes more complex. For example, improving relevance in OpenSearch still requires deep expertise in query Domain-Specific Language (DSL), ranking logic, and hybrid search tuning. Troubleshooting latency or cluster health issues often means manually piecing together signals from logs, traces, shards, and infrastructure metrics. Even migrations from Elasticsearch or Solr can become complex and time-consuming because of schema conversion, compatibility gaps, and performance optimization challenges. As AI agents become a primary interface for building and operating applications on OpenSearch, a deeper gap emerges. Translating high-level intent into query DSLs, index configurations, and multi-step workflows still requires significant expertise. At the same time, workflows remain fragmented across domains like search, logs, and observability, forcing teams into siloed tooling and disconnected reasoning. The result is repeated trial-and-error, lack of standardized approaches, and slower time-to-value, despite the promise of faster development.

What are Agent Skills?

Agent Skills, developed by Anthropic, are a lightweight, open format for extending AI agent capabilities with specialized knowledge and workflows. They’re supported by a growing number of AI tools and agentic clients, including Kiro, Claude Code, Cursor, VS Code, GitHub Copilot, Codex and others.

At their core, Agent Skills are pre-built intelligence you can call, extend, and reuse. Each skill encapsulates domain knowledge, execution logic with multi-step workflows, and guidance with explainability, so you not only get results but understand how they’re achieved. Instead of stitching together tools and writing custom logic, you can invoke a skill to handle an entire task, from analysis to recommendation to execution.

At launch, OpenSearch Agent Skills introduces three foundational skills designed to address some of the most common and complex developer workflows: Search, Logs, and Solr to OpenSearch Migrations.

Search skill

The Search Skill builds on the foundation introduced by OpenSearch Launchpad, and brings an agentic, intent-driven experience to building and optimizing search applications with OpenSearch. Developers can go from a simple requirement or sample document to a fully working search application in minutes, whether lexical, semantic, hybrid, or agentic, with no

deep OpenSearch expertise required.

What it does:

Translates natural language requirements or sample data into search configurations.
Automatically creates index mappings, ingest pipelines, and ML model integrations.
Sets up keyword, semantic, and hybrid search capabilities out of the box.

Example

Build a semantic search application for product documentation

Output:

Fully configured OpenSearch index with optimized mappings.
Integrated embedding models and ingest pipeline.
Working search experience (API + UI) ready to test and iterate.

The Search Skill builds on the foundation introduced by OpenSearch Launchpad, extending the same capabilities into an agent-native workflow. You can move from idea to a production-ready search application in minutes, eliminating manual setup and accelerating both prototyping and deployment in OpenSearch.

Logs skill

The Log Skill analyzes log data and investigates distributed traces directly within OpenSearch, bringing agentic intelligence to observability workflows. Instead of manually crafting PPL queries or piecing together trace data across services, developers can express their intent and let the skill

handle the complexity.

What it does:

Queries and analyzes log data using PPL, including error patterns, log volume trends, and anomaly detection.
Investigates distributed traces, identifying slow spans, error spans, service dependencies, and agent invocations.
Correlates logs and traces using traceId to surface root causes across the full observability stack.

Example:

Investigate why my service is returning 500s and correlate with recent traces

Output:

PPL query results surfacing error patterns and log volume anomalies.
Trace analysis identifying slow or failing spans and service dependencies.
Correlated view linking log errors to specific trace IDs for faster root cause analysis.

With the Logs Skill, you can move from a vague symptom to a pinpointed root cause in minutes without needing to master PPL syntax or manually navigate trace data.

Solr to OpenSearch migration skill

The Migration Skill streamlines the complex process of migrating from Solr to OpenSearch. Migrations typically involve cluster discovery, compatibility checks, schema translation, data movement, and validation. These steps often require deep expertise and manual coordination. The

Migration skill turns all these steps into a guided, automated workflow.

What it does:

Discovers and analyzes source clusters, including indices, mappings, and configurations.
Performs compatibility assessment and highlights breaking changes or required transformations.
Translates schemas, index settings, and queries into OpenSearch-compatible formats.

Example:

How can I migrate from Solr to OpenSearch?

Output:

Detailed migration plan with compatibility report and required changes.
Translated index mappings and configurations ready for OpenSearch.
Executed data migration pipeline with progress tracking.
Validation report confirming data integrity and query parity between source and target.

With the Migration Skill, developers can move from a fragmented, high-risk migration process to a structured, automated workflow. This approach provides faster transitions, reduced downtime, and confidence in production readiness.

How it works

OpenSearch Agent Skills are organized as a tree of SKILL.md files, structured by domain category. Rather than one monolithic skill that loads everything, the repo is broken into focused, independently installable skills. Each skill is small enough to stay within a tight context window, but

complete enough to handle real end-to-end workflows.

The top-level structure currently groups skills into three categories:

Search: opensearch-launchpad for building BM25, semantic, and hybrid search applications from scratch.
Observability: log-analytics for PPL-based log querying and error analysis, and trace-analytics for distributed trace investigation and span analysis.
Cloud: aws-setup for deploying to Amazon OpenSearch Service (managed) or Amazon OpenSearch Serverless, with separate manifests for each.

Each skill bundles everything the agent needs: step-by-step workflows, reference docs (like PPL syntax guides and CLI references), and executable scripts that run directly against your cluster.

When you say “build a hybrid search app” or “why is my service throwing 500 errors?”, the agent activates only the matching skill, follows its instructions, and executes the right OpenSearch APIs. It returns results alongside clear explanations of what was configured and why. Because skills load on demand, you can have the full collection installed without bloating your agent’s context window.

We’re continuously expanding the skill library. Categories like Dashboard and Migration are already on the roadmap, with more to come as the ecosystem grows.

Getting started

Getting started with OpenSearch Agent Skills is straightforward. No MCP server or extras are required. Skills are installed using npx skills and work directly with your existing agentic IDE.

Prerequisites:

Python 3.11+ and uv.
Docker installed and running.
AWS credentials configured (optional, for cloud deployment).

Install all skills:

npx skills add opensearch-project/opensearch-agent-skills

Or install a specific skill: (e.g. opensearch-launchpad)

npx skills add opensearch-project/opensearch-agent-skills@opensearch-launchpad --full-depth

npx skills add opensearch-project/opensearch-agent-skills@log-analytics --full-depth

npx skills add opensearch-project/opensearch-agent-skills@trace-analytics --full-depth

npx skills add opensearch-project/opensearch-agent-skills@migration-companion --full-depth

Once installed, simply express your intent to your agent, for example, “I want to build a semantic search app with OpenSearch,” and the agent reads the skill instructions and runs the scripts automatically.

Skills can also be installed to a specific agent (-a claude-code), globally across all projects (-g), or to all detected agents (--all). Explore available skills before installing with --list.

Looking ahead

This is just the beginning. We’re actively expanding the OpenSearch Agent Skills ecosystem with new capabilities across advanced relevance tuning, cost-aware performance optimization, index lifecycle and schema evolution, and cross-domain workflows that unify search, logs, and analytics.

Over time, we see Agent Skills becoming a community-driven knowledge layer across OpenSearch domains where solving a complex problem once means everyone benefits. More importantly, Agent Skills mark a fundamental shift in how developers build and operate with OpenSearch: moving away from manual, fragmented workflows toward intelligent, reusable capabilities that guide, optimize, and accelerate development at every stage.

Get involved

OpenSearch Agent Skills is designed to be an open, evolving ecosystem, and we’re getting started. Here’s how you can participate:

Try it in your workflow. Install the skills in Claude, Cursor, or Kiro and start interacting with OpenSearch using natural language. Build new applications, investigate issues, or run migrations, and see how far intent-driven workflows can go.
Build and extend skills. Agent Skills are intentionally modular and extensible. Create your own skills to encode domain-specific workflows, internal best practices, or repeatable operational playbooks. Whether it’s a custom relevance tuning flow or a specialized observability pipeline, your contributions can become reusable intelligence for others.
Contribute to the ecosystem. We welcome contributions across all levels, from improving documentation and fixing bugs to adding entirely new skills. If you’ve solved a complex problem with OpenSearch, consider turning it into a skill and contribute to the Git repo.
Share feedback and ideas. Let us know what worked, what didn’t, and what capabilities you’d like to see next, whether it’s deeper integrations, new domains, or more advanced automation.
Join the conversation. Engage with the OpenSearch community through GitHub discussions, community forums, and working groups. Collaborate with others building similar workflows and help define the future of agent-driven search and observability.

With OpenSearch Agent Skills, we’re moving toward a world where developers don’t only use tools but use shared intelligence. If that resonates with you, we’d love for you to be part of the journey.

Star and get involved in the OpenSearch Agent Skills repo. Join the conversation on the OpenSearch community forum and connect with us in the OpenSearch Slack channel.

Acknowledgments

We would like to extend our sincere gratitude to the following contributors for their valuable contributions to this project Arjun kumar Giri, Sarat Vemulapalli, Chenyang Li, Fen Qin, Janelle Arita, Kaituo Li, Krishna Kondaka, Owais Kazi, Peter Zhu and Zhichao Geng. Your dedication, expertise, and collaborative spirit have been instrumental in making this project successful. Thank you for your time and contributions.

About the authors

How Smartsheet built Real-time Dynamic Filtering on Apache Flink reducing $40K/month in messaging costs

Emre Kartoglu — Mon, 18 May 2026 18:59:23 +0000

Processing hundreds of thousands of events per second while maintaining sub-second latency is a challenge many organizations face when building real-time data-driven applications. When filter policy changes propagate in up to 15 minutes, dynamic event routing becomes impractical, forcing teams to over-consume events and discard over 90% after costly per-event lookups. Smartsheet, a work management solution serving millions of users and processing hundreds of thousands of events per second to power features like live collaboration, workflows, and real-time notifications, faced exactly this problem.

In this post, you learn how Smartsheet built a Real-time Dynamic Filtering (RDF) system on Amazon Managed Service for Apache Flink, cutting messaging costs by over $40,000 per month and improving live collaboration latency by 1.8x.

The challenge: Static filter policies in a dynamic world

The Smartsheet event-driven architecture publishes hundreds of thousands of events per second to an Amazon Simple Notification Service (Amazon SNS) topic. Internal teams subscribe to this topic, typically by creating an Amazon Simple Queue Service (Amazon SQS) queue with an associated SNS filter policy defined through infrastructure as code (IaC). These filter policies are typically static and specify the types of events a consumer wants to receive, such as “sheet row created,” “sheet row updated,” or “sheet row deleted.”

Although SNS supports programmatic changes to filter policies, the SNS documentation notes that changes can take up to 15 minutes to take effect. This eventual consistency window created a significant problem for Smartsheet live collaboration feature.

Live collaboration requires knowing, in real time, which sheets have active collaborators. When a user opens a sheet, the system needs to immediately start receiving events for that sheet. When they close it, the system should stop. With a 15-minute propagation delay on filter policy changes, dynamic per-sheet filtering through SNS was impractical.

The workaround was brute force: subscribe to all events (hundreds of thousands per second), pull them into an SQS queue, and use compute to check each event against Amazon DynamoDB to determine whether the sheet had active collaborators. Over 90% of events were discarded after this lookup.

Figure 1: Before RDF — all events flow through SNS to SQS, with per-event DynamoDB lookups to filter. Over 90% of events are discarded after processing.

Every event published to the SNS topic is delivered to the SQS queue, regardless of whether any consumer needs it.
The consumer AWS Lambda reads every message from the SQS queue and must evaluate each one individually.
For each event, the consumer queries DynamoDB to check whether the sheet has active collaborators. This per-event lookup adds latency and DynamoDB read costs on the hot path.
After the DynamoDB lookup, over 90% of events are found to have no active collaborators and are discarded.

This approach had three compounding cost and performance problems:

SNS-to-SQS data transfer costs: approximately $10,000 per month to deliver all events to the queue
SQS costs: approximately $30,000 per month to receive, process, and delete the full event volume
DynamoDB costs and latency: per-event lookups to check collaborator status added load to DynamoDB and increased end-to-end data delivery latency

The solution: Real-time Dynamic Filtering with Apache Flink

To solve this, Smartsheet built a system called Real-time Dynamic Filtering (RDF) on Amazon Managed Service for Apache Flink. The core insight was to move the filtering logic into the stream processing layer itself, using Flink’s KeyedCoProcessFunction, a feature that joins and processes multiple streams by a shared key, to maintain dynamic filter policies in Flink state (RocksDB).

How it works

The RDF Flink application reads from two streams:

Filter policy stream, sourced from Amazon DynamoDB Streams. When a team calls the RDF client to change their filter policy (for example, “start receiving events for sheet X”), the change is written to a DynamoDB table and propagated through DynamoDB Streams to the Flink application.
Data stream, the stream of sheet events (creates, updates, deletes) that were previously delivered through SNS.

One challenge remained: some consumers need every event, regardless of sheet. When a consumer subscribes to all events, the system needs every parallel Flink task to know about it. The team solved this using Flink’s broadcast state, which replicates a small set of “subscribe to everything” policies across all tasks. Because only a handful of consumers use this mode, the memory overhead stays negligible.

Figure 2: After RDF — consumer teams update filter policies via client libraries. DynamoDB Streams propagates changes to the Flink application, which filters the data stream in real time using keyed state (RocksDB) for specific sheet subscriptions and broadcast state for “all sheets” subscriptions.

When a consumer team wants to start or stop receiving events for a specific sheet, it calls the RDF client, a thin wrapper over the DynamoDB SDK. The filter policy change is written to that consumer’s dedicated DynamoDB table. Each consumer has its own table, providing isolated permissions and preventing noisy neighbor issues.
DynamoDB Streams captures every filter policy change as a change data capture (CDC) record and streams it to the Flink application in real time.
Filter policy records
1. Filter policy records for specific sheets are routed to the KeyedCoProcessFunction, keyed by SheetID. This makes sure that filter state and event data for the same sheet are co-located in the same Flink parallel task. State is stored in the RocksDB backend, which uses memory when available and spills to disk when necessary, so the system to scale without JVM heap constraints.
2. Filter policy records where a consumer has called listenToAllEvents() are broadcast to all parallel Flink tasks via Flink’s broadcast state. Because broadcast state lives in JVM heap, it is used exclusively for these “all sheets” records (of which there are very few), keeping the heap footprint small.
The full stream of CDC events flows into the KeyedCoProcessFunction, partitioned by SheetID. Each parallel task receives only the events for the sheets it is responsible for and applies the corresponding filter state to decide whether to forward or drop each event.
The broadcast state (containing “all sheets” subscriptions) is made available to all parallel instances of the KeyedCoProcessFunction, so that consumers subscribed to all events are never filtered out regardless of which task processes their events.
Only events that match an active filter policy are forwarded to the consumer’s SQS queue. The result: sub-second filter policy propagation (p95 ≤1s), elimination of per-event DynamoDB lookups, and over $40,000/month in cost savings.

Critically, because the filter policy state is persisted in Flink’s RocksDB state backend, the application does not need to perform a DynamoDB lookup for every event. Within 1 second of a filter policy change, the Flink application reads the change from the DynamoDB Streams source, updates its internal state, and begins filtering the data stream accordingly.

Results

The impact of RDF was immediate and measurable across multiple dimensions:

Cost reduction

Cost category	Before RDF	After RDF	Monthly savings
SNS → SQS Data Transfer	~$10K/month	Eliminated	~$10K
SQS Event Ingestion	~$30K/month	~$2K	~$28K
DynamoDB Collaborator Lookups	Significant load	Eliminated (state in Flink)	Included in total
AWS Lambda	~$12K/month	~$5K/month	~$7K
Total			~$45K/month

Latency improvement

1.8x improvement in live collaboration data delivery latency. Users see changes from collaborators faster than before.
Filter policy propagation reduced from up to 15 minutes to a p95 of under 1 second

If your architecture follows a similar fan-out pattern where consumers discard a large percentage of events after per-event lookups, you could achieve comparable cost reductions by moving filtering into the stream processing layer. The savings scale with your event volume and the percentage of events currently discarded.

Key design decisions

Several architectural choices were critical to the success of this solution:

Keyed state with selective broadcast: Specific sheet subscriptions are stored in keyed state using the RocksDB state backend. The system scales to a large number of filter policies without JVM heap constraints. Flink’s broadcast state is used only for the small number of “all sheets” subscriptions, where every parallel task needs visibility. Because broadcast state is stored in JVM heap, limiting its use to these few records keeps the heap footprint manageable.
DynamoDB Streams as the filter policy source: Rather than building a custom control plane, the team used DynamoDB Streams to propagate filter policy changes. DynamoDB Streams gave the team durability, ordering guarantees, and a native Flink source connector integration.
RocksDB state backend: Persisting filter state in RocksDB eliminated the need for external lookups on the hot path, keeping per-event processing latency low even as the number of active filter policies grows.
Client library abstraction: Publishing internal Golang and Java clients lowered the adoption barrier. The client is a thin abstraction on top of the DynamoDB SDK. Each consumer has its own dedicated DynamoDB table and corresponding filter stream, which provides two benefits: it allows fine-grained AWS Identity and Access Management (AWS IAM) permissions per client, and it mitigates the noisy neighbor problem by isolating each consumer’s filter policy traffic. Teams don’t need to understand Flink internals. They interact with a simple API to manage their subscriptions.

Next steps

The live collaboration team was the first adopter of RDF, but the architecture was designed as a shared platform. Smartsheet is now expanding RDF to additional internal teams, including workflow automation and notification routing, where similar fan-out patterns exist. The team is also exploring automatic scaling policies to optimize Flink cluster costs during off-peak hours.

Conclusion

Smartsheet Real-time Dynamic Filtering system demonstrates how Amazon Managed Service for Apache Flink can solve problems that go beyond stream processing. By combining Flink’s broadcast state pattern with CoProcessFunction, Smartsheet replaced a costly and latency-bound SNS/SQS fan-out architecture with a sub-second dynamic filtering platform. The result: over $40,000 per month in savings, 1.8x improvement in live collaboration latency, and a reusable platform that multiple teams are now adopting.

If you process high-volume event streams and need to dynamically control which events reach specific consumers, this pattern can help you reduce costs and latency, whether for live collaboration, workflow automation, notification routing, or multi-tenant event delivery.

To learn more about the services used in this post, visit:

About the authors

Optimize Amazon S3 Tables queries with Amazon Redshift

Tom Romano — Thu, 14 May 2026 16:58:31 +0000

Amazon S3 Tables with Amazon Redshift gives you a powerful combination for analytical workloads on Apache Iceberg tables. But as query volumes grow, small inefficiencies compound. For example, repeated queries, such as dashboards refreshing hourly or analysts running the same joins throughout the day, scan data directly from Amazon Simple Storage Service (Amazon S3) every time. The fully qualified three-part table references (database@catalog.schema.table) add friction for business intelligence (BI) tools and end users who expect simpler SQL syntax. And without tuning the way S3 Tables organizes your data files, queries read more files than necessary. When you address these three areas, your S3 Tables queries in Amazon Redshift become faster, simpler, and more cost-efficient, whether you’re powering a recurring dashboard or supporting ad hoc analysis at scale.

This is the third post in our S3 Tables and Amazon Redshift series. The first post covered getting started with querying Apache Iceberg tables, and the second post walked through enterprise-scale governance and access controls. In this post, you address those performance and usability gaps with three approaches:

Create external schemas to simplify queries from three-part notation down to two-part notation.
Build materialized views that store pre-computed results locally so repeated queries skip the S3 scan.
Configure S3 Tables compaction strategies so the data file layout matches your query patterns.

The following diagram shows how these three approaches work together. External schemas [1] simplify query syntax through AWS Lake Formation resource links [2], materialized views [3] store pre-computed results locally in Amazon Redshift, and S3 Tables compaction [4] optimizes the underlying file layout for your query patterns.

Prerequisites

Before you begin, make sure you have:

An AWS account with permissions to manage AWS Identity and Access Management (IAM) roles, AWS Lake Formation, S3 Tables, and Redshift.
An Amazon Redshift Serverless workgroup or Amazon Redshift provisioned cluster (patch 188 or higher).
An S3 Table bucket with a namespace and tables created.
Lake Formation configured with the AWSServiceRoleForRedshift service-linked role as a read-only administrator.

If you haven’t completed these steps, follow the setup instructions in the first post in this series.

Simplify queries with external schemas

The previous posts in this series used the auto-mounted catalog to query S3 Tables with three-part notation:

SELECT * FROM redshifticeberg@s3tablescatalog.icebergsons3.examples;

You can use this syntax, but it can be cumbersome in business intelligence (BI) tools, manually typing queries, and in application code. This syntax also requires the user to use IAM federation. By creating an external schema, you can reference the same tables with a concise two-part notation:

SELECT * FROM s3tables_schema.examples;

To set this up, you create a Lake Formation resource link that maps to your S3 Tables catalog, then create an external schema in Amazon Redshift that points to that resource link. Your setup differs slightly depending on whether your users authenticate through IAM federation or database credentials. While this doesn’t change query performance, it removes a common barrier to adoption by simplifying the reference.

Create a Lake Formation resource link

Both authentication methods require a resource link in Lake Formation that points to your S3 Tables database.

In the Lake Formation console, choose Databases under Data Catalog.
On the Create menu, choose Resource link.
Configure the resource link with the following settings:
- Resource link name: s3tables_rl
- Destination Catalog: Your account ID (for example, 111122223333)
- Shared Database: Your S3 Tables database (for example, icebergsons3)
- Shared Database’s Catalog ID: Your S3 Table bucket in the format 111122223333:s3tablescatalog/redshifticeberg

For more information, see Creating resource links in the Lake Formation documentation.

Option A: External schema for IAM federated users

If your users connect to Amazon Redshift through IAM federation, create the external schema with the SESSION keyword. This passes the federated user’s credentials through to Lake Formation for access control:

CREATE EXTERNAL SCHEMA s3tables_schema
FROM DATA CATALOG
DATABASE 's3tables_rl'
CATALOG_ID '111122223333'
IAM_ROLE 'SESSION'
CATALOG_ROLE 'SESSION';

Lake Formation evaluates your permissions based on your federated user’s IAM role, and sees only the tables and columns their role allows. This is the recommended approach for new deployments because it provides fine-grained access control without additional role management.

Option B: External schema for database users

External applications like Tableau, PowerBI, and custom ETL tools often authenticate with database credentials instead of IAM federation. These users need an IAM role to access S3 Tables on their behalf.

Create an IAM service role to access S3 Tables:

You create a role (for example, S3TableAccessRole) with a trust policy that allows Amazon Redshift to assume it:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "redshift.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

You then attach the following permission policies to the role:

A policy for Lake Formation data access (substitute your 12-digit AWS Account ID for YOUR_ACCOUNT_ID):

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "lakeformation:GetDataAccess",
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "YOUR_ACCOUNT_ID"
                }
            }
        },
        {
            "Effect": "Deny",
            "Action": "lakeformation:PutDataLakeSettings",
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "aws:ResourceAccount": "YOUR_ACCOUNT_ID"
                }
            }
        }
    ]
}

A policy for AWS Glue Data Catalog access (substitute the appropriate AWS Region for REGION_ID and your 12-digit AWS Account ID for YOUR_ACCOUNT_ID):

For production, scope these permissions to your specific resources and AWS Region.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "glue:GetTable",
                "glue:GetTables",
                "glue:GetTableVersion",
                "glue:GetTableVersions",
                "glue:GetTags"
            ],
            "Resource": [
                "arn:aws:glue:REGION_ID:YOUR_ACCOUNT_ID:catalog",
                "arn:aws:glue:REGION_ID:YOUR_ACCOUNT_ID:database/*",
                "arn:aws:glue:REGION_ID:YOUR_ACCOUNT_ID:table/*/*"
            ]
        }
    ]
}

Grant Lake Formation permissions to the role:

In the Lake Formation console, grant the S3TableAccessRole DESCRIBE access on the database and SELECT access on the tables for your resource link. For detailed steps, see Granting Lake Formation permissions.

Associate the role and create the schema:

First, associate the IAM role with your Amazon Redshift cluster or workgroup. For instructions, see Associating IAM roles with Amazon Redshift.

Create the external schema:

CREATE EXTERNAL SCHEMA s3tables_schema
FROM DATA CATALOG
DATABASE 's3tables_rl'
IAM_ROLE 'arn:aws:iam::111122223333:role/S3TableAccessRole';

Then grant access to your database users:

GRANT USAGE ON SCHEMA s3tables_schema TO my_database_user;

Query with two-part notation

With either option, you can now query S3 Tables using the simpler two-part notation:

SELECT * FROM s3tables_schema.examples LIMIT 10;

You can use this notation in BI tools, JDBC/ODBC connections, and application code and no longer need to know the underlying catalog structure.

Accelerate queries with materialized views

When you repeatedly query S3 Tables, each execution scans the external data from S3. Materialized views store pre-computed results in Amazon Redshift, so subsequent queries read from local storage instead of scanning S3 on every run.

Redshift supports incremental refresh for materialized views on Apache Iceberg tables, including INSERT, DELETE, UPDATE, and table compaction operations. After the initial creation, Amazon Redshift processes only the rows that changed since the last refresh when you run subsequent refreshes, rather than recomputing the full result set. This helps reduce both the time and compute cost of keeping your views current, especially for large tables with frequent changes.

Materialized views have general limitations and considerations when used with external data lake tables. For details, see Materialized views on external data lake tables.

Create a materialized view on S3 Tables

The following example creates a materialized view that joins the examples table in S3 Tables with a local categories table in Amazon Redshift. You can use a materialized view to pre-compute daily record counts and data samples per category:

CREATE MATERIALIZED VIEW mv_daily_category_summary
DISTSTYLE KEY
DISTKEY (category_id)
SORTKEY (insert_date)
AS
SELECT
    c.category_id,
    c.department,
    e.insert_date,
    COUNT(*) AS record_count,
    COUNT(DISTINCT e.id) AS unique_ids
FROM s3tables_schema.examples e
JOIN public.categories c
  ON c.category_id = e.category_id
GROUP BY c.category_id, c.department, e.insert_date;

Query the materialized view directly:

SELECT category_id, department, insert_date, record_count
FROM mv_daily_category_summary
ORDER BY record_count DESC
LIMIT 10;

Your query can now read from local Amazon Redshift storage and typically returns results without scanning S3 Tables:

Refresh strategies

You have two options for keeping materialized views current:

Automatic refresh: Set AUTO REFRESH YES in the view definition to have Amazon Redshift automatically refresh the view in the background when it detects changes to the base tables. This is a good fit for dashboards and reports that can tolerate a short delay between data changes and query results. Note that automatic refresh requires Option B (database user) when creating the external schema, and the default is AUTO REFRESH NO.

Manual refresh: Run REFRESH MATERIALIZED VIEW when you need to control the timing:

REFRESH MATERIALIZED VIEW mv_daily_category_summary;

Use manual refresh when you need to coordinate updates with data loading pipelines or when you want to refresh during off-peak hours.

Tune S3 Tables compaction for your query patterns

S3 Tables automatically compacts small Parquet files into larger ones in the background. This compaction reduces the number of read requests your query engine must make, which can improve query performance. By default, compaction targets a file size of 512 MB, configurable between 64 MB and 512 MB. Four compaction strategies are available, and choosing the right one for your query patterns can make a measurable difference.

Compaction strategies

Strategy	When to use	How it works
Auto	You want S3 to decide for you	Selects sort compaction for sorted tables, binpack for unsorted tables
Binpack	General-purpose workloads, unsorted tables	Combines small files into larger files (100 MB+) and applies pending row-level deletes
Sort	Queries frequently filter on a single column (e.g., `insert_date`)	Organizes data by the table’s sort-order columns during compaction
Z-order	Queries filter on two or more columns together (e.g., `insert_date` and `category_id`)	Blends multiple column values into a single scalar for sorting

Binpack improves performance by reducing the number of files a query engine reads. Sort compaction goes further. By ordering data within files, it enables query engines to skip entire files based on column min/max metadata during predicate pushdown. This is effective for queries that filter on the sort column, such as date-range filters. Z-order extends this benefit to queries that filter on multiple columns simultaneously, at the cost of slightly less efficient pruning on any single column compared to a pure sort.

To use sort or z-order compaction, you first need to verify that the table is sorted by one (sort) or multiple (z-order) columns:

-- Sort
ALTER TABLE icebergsons3.examples WRITE ORDERED BY insert_date;

-- Z-Order
ALTER TABLE icebergsons3.examples WRITE ORDERED BY insert_date,category_id;

Configure a compaction strategy

To change the compaction strategy for a table, use the PutTableMaintenanceConfiguration API through the AWS Command Line Interface (AWS CLI):

aws s3tables put-table-maintenance-configuration \
    --table-bucket-arn arn:aws:s3tables:us-east-1:111122223333:bucket/redshifticeberg \
    --type icebergCompaction \
    --namespace icebergsons3 \
    --name examples \
    --value '{"status":"enabled","settings":{"icebergCompaction":{"strategy":"sort"}}}'

To adjust the target file size (for example, to 256 MB):

aws s3tables put-table-maintenance-configuration \
    --table-bucket-arn arn:aws:s3tables:us-east-1:111122223333:bucket/redshifticeberg \
    --type icebergCompaction \
    --namespace icebergsons3 \
    --name examples \
    --value '{"status":"enabled","settings":{"icebergCompaction":{"targetFileSizeMB":256}}}'

Similar to the “sort” example, you can specify {"strategy":"z-order"} for z-order compaction.

For more detail on sort and z-order, see Improve Apache Iceberg query performance in Amazon S3 with sort and z-order compaction.

Snapshot management

S3 Tables manage snapshots automatically. By default, it keeps a minimum of 1 snapshot and expires snapshots older than 120 hours (5 days). The snapshot retention is customized by setting minSnapshotsToKeep and maxSnapshotAgeHours. After a snapshot reaches the expiration time you configured in your retention settings, S3 Tables marks objects that only that snapshot references as noncurrent and removes them based on the unreferenced file removal policy.

You can adjust these settings if your workload needs more snapshots for time-travel queries or longer retention:

aws s3tables put-table-maintenance-configuration \
    --table-bucket-arn arn:aws:s3tables:us-east-1:111122223333:bucket/redshifticeberg \
    --namespace icebergsons3 \
    --name examples \
    --type icebergSnapshotManagement \
    --value '{"status":"enabled","settings":{"icebergSnapshotManagement":{"minSnapshotsToKeep":10,"maxSnapshotAgeHours":2500}}}'

Keep in mind that retaining more snapshots increases storage costs. If a materialized view references an expired snapshot, Amazon Redshift falls back to a full recompute on the next refresh. Therefore, snapshot retention can directly affect your materialized view refresh behavior. Balance snapshot retention with your materialized view refresh frequency to avoid unnecessary full recomputes.

For more information, see Maintenance for tables in the Amazon S3 documentation.

Best practices

Choose the right access pattern for your users. Use IAM federation with SESSION credentials for new applications and interactive users. Reserve the IAM role approach for BI tools and extract, transform, and load (ETL) pipelines that can’t integrate with IAM federation directly. Plan to migrate database users to federated access over time.

Match compaction strategy to query patterns. Use sort compaction when your queries filter on a single column (such as date ranges). Use z-order when queries filter on two or more columns together. Stick with the auto default if your query patterns vary or you’re unsure.

Size materialized views for your refresh window. Materialized views that join large external tables with local tables take longer to refresh. If your data changes frequently, keep the materialized view focused on the specific aggregations your dashboards need rather than materializing entire tables.

Coordinate snapshot retention with materialized view refresh. If a materialized view references an expired Iceberg snapshot, Amazon Redshift performs a full recompute instead of an incremental refresh. Set your snapshot retention (maxSnapshotAgeHours) longer than your materialized view refresh interval.

Monitor compaction with AWS CloudTrail. S3 Tables logs compaction operations as CloudTrail management events. Track these to verify that compaction runs on schedule and to identify tables that might benefit from a different strategy.

Balance performance gains against storage costs. Materialized views store pre-computed results in Amazon Redshift, adding to your managed storage. Compaction reduces file counts, but z-order and sort compaction can increase overall storage because of data duplication across sort boundaries. Review your Amazon Redshift managed storage usage and S3 Tables storage metrics periodically to make sure the performance benefits justify the additional storage utilization.

Troubleshooting

Issue	Resolution
“Permission denied” when creating the external schema	Verify the IAM role has `lakeformation:GetDataAccess` permission. Confirm you associated the role with your Amazon Redshift cluster or workgroup. Also check that you granted the role access to the resource link database and its tables in Lake Formation.
“Schema not found” or “Database not found” errors	Confirm the resource link name in Lake Formation matches the `DATABASE` value in your `CREATE EXTERNAL SCHEMA` statement. Verify the catalog ID format uses the pattern `account_id:s3tablescatalog/bucket_name`.
“Table not found” when querying through the external schema	Check that Lake Formation permissions include table-level access, not just database-level. Verify the table exists in the S3 Tables catalog by querying it through the auto-mounted catalog first.
Materialized view refresh falls back to full recompute	Check if the referenced Iceberg snapshot has expired. Increase `maxSnapshotAgeHours` in the snapshot management configuration. Verify that the base table hasn’t exceeded 4 million position deletes in a single data file. Compaction resolves this.
Queries on S3 Tables are slow after data loading	Compaction runs on an automated schedule and may not have processed recent writes yet. Check CloudTrail for the latest compaction event. Verify the compaction strategy matches your query patterns. Switch from binpack to sort if you filter on specific columns.

Cleaning up

To avoid ongoing costs, remove the resources you created in this walkthrough:

-- Drop materialized views
DROP MATERIALIZED VIEW IF EXISTS mv_daily_category_summary;

-- Drop external schemas
DROP SCHEMA IF EXISTS s3tables_schema;

Also remove:

The IAM role (S3TableAccessRole) and its attached policies, if you created one for database users.
The Lake Formation resource link and associated permissions.
The S3 table bucket, if you no longer need the data.

Conclusion

In this post, we showed how to optimize S3 Tables queries from Amazon Redshift using three approaches: external schemas that simplify query syntax from three-part to two-part notation, making it easier for BI tools and end users to work with S3 Tables. We also covered materialized views for pre-computed analytical results that reduce repeated S3 scans, and S3 Tables compaction strategies tuned to your query patterns for more efficient file access.

For new applications, design your access layer with IAM federation and external schemas from the start. Use materialized views to accelerate repeated analytical queries that join S3 Tables with local Amazon Redshift data. Match your compaction strategy to how your team queries the data. Use sort compaction for date-range filters and z-order when queries filter on multiple columns at once. Furthermore, the same S3 tables you optimize here are also accessible from Amazon Athena, Amazon EMR, and third-party engines.

To learn more, see the Amazon S3 Tables documentation, Materialized views in Amazon Redshift, and S3 Tables maintenance. We welcome your feedback in the comments.

About the authors

Securing client confidentiality at scale: Automated data discovery and governed analytics for legal workloads

Rohan Kamat — Wed, 13 May 2026 15:57:14 +0000

Automating data security and analytics for legal documents presents a unique challenge when your legal team stores documents with strong access controls, organized by client and matter, encrypted at rest, and governed by well-defined policies. But what happens when you want to run analytics across those repositories? The typical path is extracting content into separate data pipelines or third-party tools, which fragments your governance model and introduces new risks. Law firms and corporate legal departments operate under distinct obligations that make data governance non-negotiable. Attorney-client privilege, work product doctrine, and professional conduct rules impose strict duties around how client information is handled, accessed, and disclosed. Governance failure in this context isn’t just a compliance gap, it can result in privilege waiver, disqualification from representation, or disciplinary action.

Legal professionals use ethical walls, also called information barriers, as structural safeguards that prevent the flow of confidential information between teams within a firm that represent adverse or potentially conflicting interests. Professional conduct rules mandate these barriers, and failure to maintain them can result in firm disqualification, malpractice liability, or regulatory sanctions.

Privilege boundaries are equally critical. Attorney-client privilege and work product protection apply only when you properly control access to the underlying material. If you expose privileged documents or metadata about their contents to unauthorized individuals, you risk losing your privilege protection. When organizations fail to maintain reasonable controls over privileged material, courts might find that they have waived their privilege. You should therefore actively manage your access governance, not only as a security concern but as a legal preservation requirement.When you extract content into separate analytics systems or grant broader access than your matter structures support, you create pressure on both protections. You gain visibility but lose confidence in your controls.

In this post, we show you a reference architecture that automates sensitive data discovery across legal document repositories on Amazon Web Services (AWS), demonstrate how to capture structured findings as a compliance dataset, and guide you through building a governed analytics workspace that maintains your security boundaries. You walk away with a practical model for building security and analytics into the same lifecycle, without moving documents outside their system of record.

Analytics shouldn’t weaken governance

Most legal organizations have invested heavily in securing their document repositories. You store documents in structured storage, organized by client and matter. You access controls map to matter boundaries (the organizational and access structures that separate one client engagement from another). You establish retention and hold policies.The difficulty starts when teams want to analyze what’s inside those repositories. Running analytics typically means copying content into a separate system, standing up a new data pipeline, or granting broader access than existing matter structures support. Each of these steps introduces governance gaps. Manual reporting fills some of the void, but it doesn’t scale and can’t provide continuous visibility. What’s missing is a model where security controls and analytics reinforce each other, where the act of discovering sensitive data also produces the dataset that you use for reporting, and where governance applies once and carries through every downstream operation.

Automation addresses this by combining continuous sensitive data discovery with governed analytics, built on discovery metadata rather than document copies. This automated approach delivers four key advantages:

No document movement. Your files stay in their system of record. Analytics runs against structured discovery metadata, not document content, so governance boundaries remain intact.
Continuous discovery instead of manual scanning. Automated classification identifies regulated and sensitive information on an ongoing basis, replacing periodic manual reviews with on demand visibility.
Unified governance. You define matter-aligned access policies once, and they carry through from document storage to findings analytics and compliance reporting.
Built-in audit readiness. A durable record of discovery findings and remediation actions accumulates automatically over time, giving you structured evidence for client reviews and regulatory inquiries.

Reference Architecture

The following architecture shows how continuous discovery, governance, and compliance operations can work together without copying legal documents into analytics systems.

Architecture walkthrough

Store and protect documents in Amazon Simple Storage Service (Amazon S3)

Store your legal documents in Amazon S3, which serves as the system of record for document content. Align your buckets and prefixes to client and matter structures so that access controls map directly to matter boundaries. Where your retention or legal hold requirements demand it, apply S3 Object Lock to enforce immutability. You can encrypt your data using AWS Key Management Service (AWS KMS), which gives you centralized control over encryption keys and policies.

Discover and classify sensitive data with Amazon Macie

You will configure Amazon Macie to continuously analyze your document repositories. Macie identifies regulated information such as personally identifiable information (PII), financial data, and other sensitive content and produces structured findings that describe what Macie identified and where it exists. This provides ongoing visibility into data exposure without requiring document movement or manual scanning.

Catalog and govern findings with AWS Glue and AWS Lake Formation

You will use AWS Glue to catalog the findings dataset and maintain its schema so it stays query-ready. Apply AWS Lake Formation tag-based policies to govern access, aligning tags to client, matter, and confidentiality tier. This approach enforces ethical walls and least-privilege access consistently across analytics and reporting activities.

AI-powered chat agent using Amazon Quick Suite

You can create custom chat agents to tailor conversational interfaces for specific legal business needs. These agents can be configured with legal-specific knowledge bases, connected to relevant document repositories, and customized with instructions appropriate for legal workflows. You can use this chat agent to interact with your legal documents through natural language conversation for capabilities like:

E-Discovery:Search and analyze large volumes of legal documents to quickly find relevant information across your document repository.
Contract Analysis:Review contracts and automatically extract key terms, clauses, and obligations to streamline your contract review process.

The chat agent can help you navigate complex document sets through conversational queries, making legal research and document review more efficient and accessible.

Analyze and report with Amazon Quick Sight

You will use Amazon Quick as your compliance operations workspace. Quick provides a unified environment where your teams can query findings, generate dashboards, track remediation actions, and produce audit-ready reports. The agentic AI capabilities of Amazon Quick can autonomously build analyses, surface anomalies across matters, generate executive summaries for client reviews, and proactively recommend remediation priorities based on finding severity and trends. Combined with built-in data stories for automated narrative generation and pixel-perfect paginated reports for regulatory submissions, Quick reduces the time from discovery to action while keeping your teams within a governed interface aligned to matter-based permissions. Rather than switching between separate visualization, workflow, and reporting tools, your legal and compliance teams can review findings, manage response activities, and collaborate all within a single workspace that respects ethical walls and privilege boundaries.

Escalate high-severity findings

For high-severity findings that demand immediate attention, route alerts through AWS Security Hub or Amazon Simple Notification Service (Amazon SNS) to trigger escalation workflows. This connects visibility directly to action when your teams identify sensitive data risks.

Why this approach works for legal

Documents stay where they belong. Your files remain in Amazon S3, aligned to client and matter boundaries. No content moves into separate analytics pipelines.Ethical walls remain intact. Because analytics is built on discovery findings and not document copies, you can govern access to findings using the same matter-aligned controls that apply to documents. Compliance and security teams gain visibility without expanding document access.Discovery runs continuously, not periodically. Rather than scheduling quarterly or annual scans, you maintain a current view of sensitive data across your repositories.

Governance applies once and carries through. Lake Formation tag-based policies govern findings access at the catalog level. You define your matter and confidentiality mappings once, and they carry through to every dashboard, query, and report.Audit readiness is built in. Instead of assembling reports manually before a client review or regulatory inquiry, you maintain a historical record of discovery findings and remediation actions. You can demonstrate your posture over time with consistent, structured evidence.

Security and analytics reinforce each other. Your analytics capability is built on top of your security controls, not alongside them. Strengthening one strengthens the other.

Cost considerations

The primary cost drivers for this architecture include:

Amazon Macie: You pay based on the number of S3 buckets evaluated and the volume of data inspected for sensitive data discovery. Review Amazon Macie pricing for current rates.
Amazon S3: Storage costs for both your document repositories and the compliance intelligence bucket. Consider S3 lifecycle policies to tier older findings into lower-cost storage classes.
AWS Glue and AWS Lake Formation: Charges for crawlers and catalog storage. For most implementations, these costs are modest.
Amazon QuickSight: Per-user pricing based on the edition that you select (Standard or Enterprise). Enterprise edition supports row-level and column-level security, which aligns well with matter-based governance.
Amazon EventBridge, AWS Security Hub, and Amazon SNS: Charges based on event volume and notifications delivered. For findings-based workflows, these costs are generally low.

Use the AWS Pricing Calculator to estimate costs based on your repository size, user count, and discovery frequency.

Getting started

Start by identifying a representative set of document repositories in Amazon S3. We recommend that you start with two or three matters that span different practice areas and confidentiality tiers.

Turn on Amazon Macie for those repositories and configure automated sensitive data discovery.
Catalog the findings dataset with AWS Glue and apply Lake Formation tag-based access policies aligned to your matter structure.
Build your first Amazon Quick Sight dashboard to visualize findings by matter, sensitivity type, and severity.
Define escalation rules in AWS Security Hub or Amazon SNS for high-severity findings.

After you validate this workflow against your initial repositories, expand gradually. Add more repositories to Macie discovery. Refine your governance tags to reflect practice areas and confidentiality tiers. Extend your dashboards from basic posture visibility to trend analysis and remediation tracking.The goal isn’t to build a comprehensive analytics solution all at once. Start with a secure foundation where discovery findings, governance, and reporting operate together in a way that aligns with your legal workflows, and then expand from there.

Conclusion

You don’t have to choose between protecting client data and understanding it. By building analytics on top of governed discovery findings and using a unified compliance workspace, you gain visibility into your data posture without weakening confidentiality boundaries.This approach brings security, governance, and analytics together in a way that reflects how legal work is actually structured. It provides continuous visibility, supports audit readiness, and delivers insight without requiring documents to move outside their system of record.

Next steps

Review the Amazon Macie User Guide to understand sensitive data discovery configuration options and Amazon Quick Sight documentation to evaluate dashboard and row-level security capabilities.

Contact your AWS account team to discuss implementation support for legal and compliance workloads.

About the authors

Streamlined monitoring and debugging for Amazon EMR on EC2

Parul Saxena — Tue, 12 May 2026 15:59:22 +0000

As organizations scale their data processing and analytics workloads on Amazon EMR on EC2, observability across cluster health, job execution, and resource usage becomes increasingly important. Teams often manage log collection across distributed nodes, correlate Amazon EMR steps with underlying YARN applications, and configure monitoring agents to capture the right level of detail for their environment.

With Amazon EMR release 7.11.0 and updates to the Amazon EMR console, Amazon EMR on EC2 introduces observability capabilities that streamline these workflows further. In this post, we walk you through five key enhancements: Amazon CloudWatch Logs integration, step-level Amazon Simple Storage Service (Amazon S3) logging controls, expanded console UIs for YARN and Tez, Amazon EMR step to YARN application ID mapping, and enhanced custom metrics with updated documentation.

What’s new

The following sections cover key improvements across the Amazon EMR console, logging, metrics collection, and documentation to give you deeper, end-to-end visibility into your Amazon EMR clusters and workloads.

1. CloudWatch Logs integration

Starting with Amazon EMR release 7.11.0, you can stream cluster logs to Amazon CloudWatch Logs in near real time without requiring custom bootstrap actions or manual agent configuration. With Amazon CloudWatch logging enabled, Amazon EMR automatically captures and streams Amazon EMR step execution logs, Spark driver, and Spark executor logs as they’re generated. This makes them immediately available for monitoring, troubleshooting, and post-mortem analysis through the CloudWatch console or API.

You can enable CloudWatch logging through the Amazon EMR console during cluster creation or programmatically using the AWS Command Line Interfaced (AWS CLI) and SDK by including the Amazon CloudWatch Agent in your application configuration and specifying your logging preferences in the configuration section.

With minimal configuration, Amazon EMR captures step logs and Spark driver logs by default, streaming them to a log group named /aws/emr/{cluster_id}. For production workloads requiring stricter organizational and security controls, you can customize the log group name, define a log stream prefix for streamlined filtering, enable encryption with an AWS Key Management Service (AWS KMS) key, and explicitly select which log types to capture. The following example demonstrates a fully customized configuration:

aws emr create-cluster
--name "EMR cluster with custom CloudWatch Logs"
--release-label emr-7.11.0
--applications Name=Spark Name=AmazonCloudWatchAgent
--instance-type m7g.2xlarge
--instance-count 3
--use-default-roles
--monitoring-configuration '
"CloudWatchLogConfiguration":
"Enabled": true,
"LogGroupName": "/my-company/emr/production",
"LogStreamNamePrefix": "cluster-prod",
"EncryptionKeyArn": "arn:aws:kms:us-east-1:123456789012:key/12345678-1234-1234-1234-123456789012",
"LogTypes": {
"STEP_LOGS": ["STDOUT", "STDERR"],
"SPARK_DRIVER": ["STDOUT", "STDERR"],
"SPARK_EXECUTOR": ["STDERR", "STDOUT"]
}
}
}'

This configuration directs the logs to a custom log group (/my-company/emr/production), prefixes log stream names with cluster-prod for consistent identification across clusters, encrypts log data at rest using the specified KMS key, and captures the full set of available log types: step stdout/stderr, Spark driver, and Spark executor output. Because logs are streamed to CloudWatch as they’re written, you have near real-time visibility into job execution without waiting for log aggregation to S3 or establishing direct connectivity to cluster nodes. Combined with CloudWatch Logs Insights, you can run structured querying across log streams, making it straightforward to trace failures, correlate errors across driver and executor logs, and build metric filters or alarms based on specific log patterns.

2. Step-level S3 logging improvements

S3 logging capabilities now provide granular control over how step logs are organized and secured. You can now specify a dedicated S3 log destination and AWS KMS encryption key at the individual Amazon EMR step level. This allows different steps within the same cluster to write logs to separate S3 paths with independent encryption configurations. This is particularly useful for multi-tenant clusters or workflows with varying data classification requirements.

Step-level logging is configured through the StepMonitoringConfiguration parameter, which accepts an S3MonitoringConfiguration object where you can define the target S3 path and an AWS KMS key for encryption at rest:

"StepMonitoringConfiguration": { "S3MonitoringConfiguration": { "LogUri": "s3://your-s3-bucket/", "EncryptionKeyArn": "arn:aws:kms:your-kms-key-arn" } }

This configuration is optional. When omitted, the step inherits the default S3 log path and encryption settings defined at the cluster level during creation. With this configuration, you can override logging behavior only for the steps that require it, while maintaining a consistent default for the rest of your workflow.

3. Enhanced console with direct access to monitoring UIs

Additional live application UIs are accessible directly from the Amazon EMR Console. These console-hosted interfaces remove the need to configure SSH (Secure Shell) tunnels, set up proxies, or establish any direct network connectivity to cluster nodes to reach application web UIs. The newly added interfaces include:

YARN ResourceManager UI – Monitor cluster-wide resource allocation, queue usage, and application lifecycle states across running and completed YARN applications. This interface also provides direct access to container-level logs for running YARN applications, enabling real-time debugging without requiring node-level access.
Tez UI – Inspect Hive query execution plans, DAG visualizations, vertex-level performance metrics, and task-level counters for queries executed through the Tez execution engine (for example, Hive and Pig workloads).

These join the existing Spark History Server and YARN timeline interfaces already available through the console. By surfacing these UIs, administrators can grant developers and analysts visibility into cluster workloads and application diagnostics without exposing direct network access to cluster infrastructure while maintaining tighter security boundaries and preserving full observability into job execution and resource consumption.

With these additions, Amazon EMR now offers three complementary approaches to accessing application web interfaces, each suited to different operational requirements. Live Application UIs provide console-hosted access to web interfaces on running clusters. They’re recommended for environments where direct network connectivity to cluster nodes must be restricted from end users. On-Cluster Web UIs offer full, unrestricted access to the complete set of native application web interfaces running on cluster nodes, suited for administrators and engineers who require deep, low-level visibility. Persistent Web UIs retain application-level data beyond cluster lifetime, so you can analyze and troubleshoot workloads on terminated clusters. Together, these options give you the flexibility to balance security boundaries, access scope, and data retention based on your team’s specific monitoring and debugging workflows.

4. EMR step to YARN application ID mapping

The Amazon EMR console now surfaces the YARN Application ID directly within the EMR step details panel. For each step executing a Spark, Hive, or other YARN-based workload, the console displays the submitted YARN Application ID associated with that step, establishing a direct link between the EMR step abstraction and the underlying YARN application. With this mapping, you can:

Directly correlate EMR steps to YARN applications – when a step fails or exhibits unexpected behavior, you can immediately identify the exact YARN application to investigate rather than manually cross-referencing timestamps or job names across interfaces.
Access live monitoring tools – with the YARN application ID readily available, you can navigate directly to the YARN ResourceManager Live UI or the Spark History Server to inspect resource consumption, task-level execution details, and application state for both running and completed jobs.
Retrieve logs for detailed troubleshooting – the application ID serves as the key lookup for retrieving container-level logs persisted to Amazon S3, significantly reducing the time to root-cause failures or diagnose performance regressions.

To use this feature, open the Steps tab on your Amazon EMR cluster detail page and select the step that you want to investigate. The YARN Application ID appears in the step details panel. From there, you can use the ID to navigate to the YARN ResourceManager Live UI at http://resourcemanager-host:8088/cluster/app/<application_id>, open the corresponding view in the Spark History Server, or locate the associated container logs in your configured S3 log destination.

5. Enhanced custom metrics and observability documentation

By default, Amazon EMR automatically sends cluster-level metrics to Amazon CloudWatch at five-minute intervals, covering YARN application states, node health, HDFS utilization, and I/O activity. With Amazon EMR Release 7.0 and later, enabling the Amazon CloudWatch Agent extends this baseline with additional detailed metrics collected at one-minute intervals across cluster nodes. Furthermore, Amazon EMR 7.1 introduced custom metric classifications that you can use to define precisely which component-level metrics to collect from Hadoop, YARN, and HBase subsystems, like DataNode I/O activity, NodeManager JVM heap utilization, container resource consumption, and HBase performance counters. Each classification supports configurable export intervals, giving you control over collection granularity based on your monitoring requirements.

After enabled, custom metrics are accessible directly from the Monitoring tab in the Amazon EMR console, where you can use a classification filter to switch between HDFS, YARN, HBase custom metric groupings that you’ve defined. Metric configurations can also be updated on running clusters through the console’s reconfiguration workflow, so you can adapt your monitoring strategy as workload requirements evolve without cluster downtime. For environments using Prometheus, metrics can also be forwarded to Amazon Managed Service for Prometheus and visualized through Grafana dashboards.

The following documentation and tutorials are available to help you get the most out of these capabilities:

Enhanced Custom Metrics Guide provides step-by-step instructions for configuring CloudWatch Agent to publish custom metrics.
EMR Observability Best Practices provides a comprehensive guide covering monitoring strategies, metric selection, and troubleshooting workflows.
Service Status Monitoring provides a tutorial on monitoring and publishing Amazon EMR application status.
Monitor Apache Spark applications on Amazon EMR with Amazon CloudWatch provides a tutorial to publish detailed Spark metrics to CloudWatch and identify performance bottlenecks in Spark application.

Getting started

These observability improvements are available now for Amazon EMR on EC2. To get started:

CloudWatch Logs integration and step-level log configuration: To use these capabilities, launch a new cluster with Amazon EMR release 7.11.0 or later.
For console enhancements: Navigate to your existing Amazon EMR clusters in the AWS Console to access Live Application UI links and YARN Application ID mappings in step details, with no additional configuration required.
For custom metrics: Review our Enhanced Custom Metrics documentation to configure the CloudWatch Agent for publishing Hadoop, YARN, and HBase component metrics using custom classification files.

Conclusion

With these enhancements, Amazon EMR on EC2 provides deeper visibility into cluster health, job execution, and resource usage, helping you reduce time to root cause and focus on delivering value from your data. Note that enabling CloudWatch Logs integration and custom metrics incurs additional CloudWatch charges based on log ingestion volume and metric publishing frequency.

If you have feedback or questions, reach out to your AWS account team or post on the AWS re:Post.

About the authors

Detect and resolve HBase inconsistencies faster with AI on Amazon EMR

Yu-Ting Su — Tue, 12 May 2026 15:56:41 +0000

HBase operations teams spend hours manually correlating logs, metadata, and consistency reports to identify root causes. Traditional approaches require deep expertise and extensive investigation across scattered data sources, directly impacting MTTR and operational efficiency. As HBase deployments scale and expertise becomes increasingly scarce, organizations face mounting pressure to maintain service reliability while managing growing operational complexity. The manual nature of troubleshooting creates bottlenecks that delay incident resolution, increase operational costs, and risk service degradation during critical business periods.

In this post, we show you how to build an AI-powered troubleshooting solution using Amazon OpenSearch Service vector search and intelligent analysis. This solution reduces HBase inconsistency resolution from hours to minutes and root cause identification from days to hours through natural language queries over operational data. This democratizes HBase troubleshooting capabilities across teams and reducing dependency on specialized expertise.

Solution overview

The solution addresses HBase troubleshooting challenges through data processing, vector search, and AI-powered analysis. It processes operational data from Amazon EMR clusters, generates semantic vector embeddings, and enables natural language queries for intelligent troubleshooting.
Key components include:

Amazon EMR HBase: Runs HBase workloads with Amazon S3 as the HBase rootdir for durable, scalable storage
Data Processing: Extracts and processes HBase logs, HBCK reports, and metadata with vector embeddings
Amazon OpenSearch Service: Provides vector search capabilities with k-NN algorithms for semantic analysis
AI Analysis Interface: Enables natural language queries with context-aware recommendations
Custom Knowledge Base: Supports organization-specific runbooks and troubleshooting procedures by ingesting Git repositories via Kiro CLI‘s /knowledge add command, enabling the AI assistant to reference custom operational guides alongside HBase source code and operational tools

The preceding diagram illustrates how the HBase log analysis system troubleshoots inconsistencies through automated workflows across AWS services.

When an operations team needs to investigate HBase issues, the engineer connects over SSH to the Amazon EMR primary node and runs the error collection script, which gathers logs from HBase master and RegionServer nodes and uploads them to Amazon S3. Next, the engineer connects to the Analytics Amazon Elastic Compute Cloud (Amazon EC2) instance and executes the automated processing script, which downloads logs from Amazon S3, generates semantic vector embeddings, and injects them into Amazon OpenSearch Service for k-NN-based semantic search. The engineer then queries the Kiro CLI AI Assistant using natural language to investigate. Kiro searches Amazon OpenSearch Service for relevant log entries and uses Amazon Bedrock to analyze patterns, correlate errors across components, and provide actionable recommendations. This reduces troubleshooting time from hours to minutes. The system operates within an Amazon Virtual Private Cloud (Amazon VPC) with private subnets for Amazon EMR and Analytics Amazon EC2, AWS Identity and Access Management (AWS IAM) roles for access control, Parameter Store for configuration, and Amazon CloudWatch for monitoring.

Prerequisites

For this walkthrough, you need the following prerequisites:

AWS account setup

An AWS account with administrative access for initial deployment
AWS Command Line Interface (AWS CLI) configured with administrative credentials

Required AWS IAM permissions

For infrastructure deployment

Your deployment user or role needs the following permissions:

Your deployment user or role requires sufficient access to AWS CloudFormation, Amazon S3, AWS IAM, and AWS System Manager.
The user or role must have the ability to create AWS CloudFormation stacks.

Infrastructure deployment:

For infrastructure deployment, you need AWS CloudFormation stack management permissions.
You also require sufficient access to create and manage the following resources:
- Amazon OpenSearch Service domains
- Amazon EC2 instances, Amazon VPCs, security groups, and networking components
- AWS IAM roles and policies
- AWS Systems Manager Parameter Store entries
- Amazon CloudWatch Logs groups
- Amazon S3 bucket for access logs and session logs

Runtime service roles

The AWS CloudFormation stack automatically creates two specialized AWS IAM roles designed with least-privilege access principles.

The first role is the Amazon OpenSearch Service Role, which manages Amazon VPC networking and Amazon CloudWatch logging for the Amazon OpenSearch Service domain.

The second role is the Application Role, which provides minimal Amazon OpenSearch Service and Amazon S3 access specifically for log processing applications and secure log ingestion operations.

Network requirements

Amazon VPC with private subnets for secure Amazon OpenSearch Service deployment
NAT Gateway for outbound internet access from private subnets
Security groups configured for HTTPS-only communication

Running Kiro CLI on Amazon EC2

Kiro platform requirements:

Kiro subscription

Active Kiro License: Valid subscription to Kiro platform
User Account: Registered Kiro user account with appropriate permissions
API Access: Kiro API keys or authentication tokens for CLI access

AWS Identity Center integration

AWS IAM Identity Center Setup: AWS IAM Identity Center enabled in your AWS organization
Permission Sets: Configured permission sets for Kiro users with appropriate AWS access
User Assignment: Users assigned to relevant AWS accounts and permission sets
SAML/OIDC Configuration: Identity provider integration if using external identity systems

Additional prerequisites

Python 3.7+ and Node.js installed locally
Python 3.11+ for AWS Lambda runtime environment (required for OpenSearch MCP server compatibility)
Sufficient service quotas for Amazon OpenSearch Service instances and Amazon EC2 resources
Recommended access to the analysis instance via AWS Systems Manager Session Manager (recommended). Amazon EMR clusters running HBase workloads
EMR_EC2_Default_Role of Amazon EMR EC2 instance profile can execute describe-stacks on AWS CloudFormation stacks in us-east-1
Basic familiarity with HBase operations

The deployment follows AWS security best practices with resource-specific permissions, regional restrictions, and encrypted data storage. All AWS IAM policies implement least-privilege access patterns to help secure operation of the log analysis pipeline.

Walkthrough

This walkthrough demonstrates deploying and configuring the AI-powered HBase troubleshooting solution in five key steps:

Deploy AWS infrastructure using AWS CloudFormation
Configure Amazon EMR analysis log collection
Process and index HBase data
Enable AI-powered analysis
Add custom knowledge base (optional)

The complete solution is available in our GitHub repository.

Step 1: Deploy the infrastructure

Deploy the required AWS infrastructure including Amazon OpenSearch Service domain, Amazon EC2 instances, and AWS IAM roles.

To deploy the infrastructure

Deploy AWS CloudFormation stack. Please update your-email@example.com to an email address for security alerts and Advanced Intrusion Detection Environment (AIDE) reports:

# Deploy to development environment
aws cloudformation create-stack \
  --stack-name dev-hbase-log-analysis \
  --template-body file://cloudformation/hbase-log-analysis-simple.yaml \
  --parameters \
    ParameterKey=EnvironmentName,ParameterValue=dev \
    ParameterKey=EC2InstanceType,ParameterValue=m7g.xlarge \
    ParameterKey=SecurityAlertEmail,ParameterValue=your-email@example.com \
  --capabilities CAPABILITY_IAM \
  --region us-east-1
# Wait for deployment to complete (~15-20 minutes)
aws cloudformation wait stack-create-complete \
  --stack-name dev-hbase-log-analysis \
  --region us-east-1

Note the deployment outputs including Amazon OpenSearch Service endpoint and Amazon EC2 instance details in the AWS CloudFormation console.

The deployment creates:

Amazon OpenSearch Service domain with vector search capabilities
Amazon EC2 instance for data processing and AI analysis
AWS IAM roles with appropriate permissions
Security groups and Amazon VPC configuration

Step 2: Connect to Amazon EC2 instance and set up system

Connect to the Amazon EC2 instance using AWS Systems Manager (SSM) and set up the required components.

To connect and set up the system

Run the following commands to get the instance ID from AWS CloudFormation outputs and connect via AWS Systems Manager (SSM):

# Get instance ID
INSTANCE_ID=$(aws cloudformation describe-stacks \
  --stack-name dev-hbase-log-analysis \
  --query 'Stacks[0].Outputs[?OutputKey==`EC2InstanceId`].OutputValue' \
  --output text \
  --region us-east-1)
# Connect via SSM
aws ssm start-session --target $INSTANCE_ID --region us-east-1

Clone the repository and run automated setup:

# On EC2 instance
sudo su - ec2-user

# Re-install aws cli
sudo dnf remove awscli -y

# For ARM64 (Graviton instances - default)
curl "https://awscli.amazonaws.com/awscli-exe-linux-aarch64.zip" -o "awscliv2.zip"

# For x86_64 (if using non-Graviton instances)
# curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"

unzip awscliv2.zip
sudo ./aws/install

# update $PATH in ~/.bashrc
echo 'export PATH=$PATH:/usr/local/bin/' >> ~/.bashrc

# Reload ~/.bashrc
source ~/.bashrc

# Fork and clone the source code repository on GitHub: sample-emr-hbase-inconsistencies-detection-recovery-mcp-kiro
git clone https://github.com/YOUR_USERNAME/sample-emr-hbase-inconsistencies-detection-recovery-mcp-kiro.git hbase-analysis
cd hbase-analysis

# Run automated setup
chmod +x ./scripts/setup/automated-system-setup.sh
./scripts/setup/automated-system-setup.sh \
  --emr-version emr-7.12.0 \
  --stack-name dev-hbase-log-analysis \
  --region us-east-1

The automated setup script installs:

System dependencies (awscli, git, unzip)
uv package manager and OpenSearch MCP Server
Kiro CLI and configuration with AWS IAM Identity Center authentication. The script will automatically add Apache HBase open source repo and Apache HBase open source operational tools to knowledge bases
HBase source repositories for your Amazon EMR version
Python dependencies and MCP server configuration

Add your own knowledge base to Kiro CLI

To enhance Kiro CLI’s analysis capabilities with Apache HBase open-source repositories, your organization’s HBase runbooks and troubleshooting guides, you can add your own knowledge base repositories. Here are the commands. Please periodically validate and maintain your runbook contents so that they remain accurate and up-to-date, reflecting any changes in your HBase environment, configurations, or operational procedures.:

# Navigate to the HBase repositories directory
cd /opt/hbase-repositories
# Clone your organization's HBase runbook repository
git clone <runbook-repository-url> <your-own-runbook-repo>
# Example:
# git clone https://github.com/your-org/hbase-runbooks.git hbase-runbooks
# git clone https://gitlab.company.com/ops/hbase-troubleshooting.git hbase-troubleshooting
# Add your custom repositories to Kiro CLI knowledge base manually (run these commands inside kiro-cli):
echo "/knowledge add --name \"Your custom HBase knowledge base\" --path /opt/hbase-repositories/<your-own-runbook-repo>" | kiro-cli
# Example:
# echo "/knowledge add --name \"Company HBase runbooks\" --path /opt/hbase-repositories/hbase-runbooks" | kiro-cli
# echo "/knowledge add --name \"HBase troubleshooting guides\" --path /opt/hbase-repositories/hbase-troubleshooting" | kiro-cli

Step 3: Configure Amazon EMR log analysis collection

Set up data collection from your Amazon EMR clusters to gather HBase logs, metadata, and consistency reports using the recommended direct collection method.
To configure Amazon EMR log analysis collection

On your Amazon EMR cluster primary node, run the following commands to download the collection scripts:

# On EMR primary node
sudo su - hadoop

# Fork and clone the source code repository on GitHub: sample-emr-hbase-inconsistencies-detection-recovery-mcp-kiro
git clone https://github.com/YOUR_USERNAME/sample-emr-hbase-inconsistencies-detection-recovery-mcp-kiro.git hbase-analysis
cd hbase-analysis

Run the interactive collection wizard:

# Run collection wizard
python3 scripts/utilities/emr_log_collection/emr_cluster_wizard_v2.py

Input the parameters like the EMR cluster’s jobflow ID, the log analysis Amazon S3 bucket name, and the lookback hours. The default value of the lookback hours is 4 hours.

The collection wizard performs these actions:

Collects HBase logs from local filesystem. Please reference to prerequisites for the access permission.
Runs sudo -u hbase hbase hbck -details (or hbck2 for HBase 2.x)
Runs hdfs dfs -ls -R /hbase or aws s3 ls <hbase-root-dir> –recursive
Runs hbase shell <<< 'scan "hbase:meta"'
Creates properly named files matching analysis system requirements
Uploads to Amazon S3 with correct naming conventions

Here’s the data collection summary:

You can check the uploaded contents through AWS CLI.

aws s3 ls s3://<log-path> --recursive

Here’s a screenshot of the outputs.

On the Analysis Amazon EC2 instance, download collected files to the Analysis Amazon EC2 instance.

# On analytics EC2 instance
sudo su - ec2-user

# Download logs from S3
mkdir -p /tmp/hbase-log-analysis
cd /tmp/hbase-log-analysis
aws s3 sync s3://<S3-BUCKET-NAME>/emr-logs/<EMR-JOBFLOW-ID>/ .

You can get your jobflow ID from Amazon EMR console:

The generated files (hbase-hbase-master-ip-xxx-xxx-xxx-xxx.ec2.internal.log.gz, hbase-hbase-regionserver-ip-xxx-xxx-xxx-xxx.ec2.internal.log.gz, hbck_report.txt, hbase_rootdir_paths.txt, hbase_meta.txt, hbase_processes.txt, log_copy_summary.txt) should be aligned with the automated processing script requirements as following.

Step 4: Process and index data

Process the collected HBase data and create vector embeddings for intelligent search capabilities.To process and index the data, please navigate to the project directory on the Analysis EC2 instance, and run automated-log-processing.sh:

sudo su – ec2-user
cd ~/hbase-analysis
chmod +x ./scripts/processing/automated-log-processing.sh
./scripts/processing/automated-log-processing.sh \
  --job-flow-id j-YOUR-JOB-FLOW-ID \
  --stack-name dev-hbase-log-analysis

The processing scripts extract and parse HBase logs and generate dimensional vector embeddings from HBase log messages using sentence transformer models to enable semantic search beyond keyword matching. The system uses the all-MiniLM-L6-v2 model by default (producing 384-dimensional embeddings), but supports configurable models with different embedding dimensions, automatically adapting the OpenSearch vector index to match the chosen model’s output. The system processes comprehensive HBase operational data including region operations, compaction activities, Write-Ahead Log events, memstore operations, and cluster management information from HMaster and RegionServer logs. Vector embeddings capture error messages, exception stack traces, performance warnings, and multi-line log entries through intelligent text preprocessing. This semantic representation enables advanced troubleshooting where users can query conceptually for “region server performance issues” or “memory pressure” and receive contextually relevant results across different log files and time periods. The vector search capabilities support error correlation by grouping similar exceptions, performance analysis by identifying related bottlenecks, and operational pattern recognition. Each log entry is stored in Amazon OpenSearch Service with original metadata (timestamp, log level, source file, job flow ID) alongside the embedding vector, enabling both structured queries and AI-powered semantic analysis. This approach transforms raw HBase logs into a searchable knowledge base supporting anomaly detection, trend analysis, and predictive insights for proactive cluster management and troubleshooting.

All scripts use AWS IAM authentication automatically. Here’s a screenshot of the data processing outputs.

Step 5: Enable AI-powered analysis

Configure the AI analysis interface to enable natural language queries against your HBase operational data.

To set up AI-powered analysis

Launch Kiro CLI (already configured by automated setup):

kiro-cliCheck mcp and knowledge bases. /mcp list

/knowledge show

If you cannot see these 2 knowledge bases, you can manually add them through the following commands:

# Note: Large repositories (~500MB) may take a while to index. Check progress with: /knowledge show
/knowledge add --name "HBase operational tools" --path /opt/hbase-repositories/hbase-operator-tools"
/knowledge add --name "Apache HBase source code" --path /opt/hbase-repositories/hbase"

Use natural language queries to analyze your HBase data. The AI analysis uses both the OpenSearch MCP Server for querying indexed data and the Filesystem knowledge bases for accessing HBase source code. You can add your custom runbooks for Kiro’s reference as well.

For HBase inconsistency analysis:

# HBase Inconsistency Detection and Remediation Guidelines
## Search Strategy
- Use fuzzy search for case variations/typos, term query for exact region IDs, match_phrase for paths, query_string for logs
- Always use .keyword subfields for exact text matching
- Cross-reference filesystem (wildcard: {"wildcard": {"path": "*<region_id>*"}}) with hbase:meta (match: {"match": {"row_key": "<region_id>"}})
- The total region count in hbase meta must match the total matched document count of wildcard path like "*/.regioninfo" in hbase rootdir path.  
- All terms of region_name.keyword for a region encoded name must match a wildcard path like "*/.regioninfo"
- All terms of table_name.keyword for a table must match a wildcard path like "*/.tabledesc*"
- 1595e783b53d99cd5eef43b6debb2682 is the master store region that will locate in <hbase-root-dir>/MasterData/data/master/store/1595e783b53d99cd5eef43b6debb2682/
- May cross check with the raw logs in /tmp/hbase-log-analysis/
## Issue Types
Orphan regions, missing .regioninfo, missing/extra regions in hbase:meta, rowkey holes, stuck RIT, master initialization failures
## Analysis Steps
### 1. Cross-Reference Meta vs Filesystem
- Filesystem regions NOT in hbase:meta → ORPHAN REGION
- Meta regions NOT in filesystem → MISSING REGION
### 2. Validate Region Chain Continuity
- Sort regions by STARTKEY, verify region[i].ENDKEY == region[i+1].STARTKEY
- First STARTKEY must be '', last ENDKEY must be ''
- Gaps → ROWKEY HOLE
### 3. Check Region States
- state != 'OPEN' → Check RIT
- Missing server assignment → UNASSIGNED
- Multiple servers → SPLIT BRAIN
- "deployed_servers" field must have only one region server address like "ip-xxx-xxx-xxx-xxx.ec2.internal,16020,1770781485397" . The value should not be null or have multiple values. 
### 4. Validate .regioninfo Files
- Missing .regioninfo in region directory → CORRUPT REGION
### 5. Cross-Check HBCK Report
- Compare orphan counts, RIT regions, filesystem vs meta region counts
### 6. Analyze Logs
- Search: "updating hbase:meta row=<region>", "STUCK", "RIT", "Failed" + "<region>", "Split"/"Merge" + "<region>"
## Remediation
- Reference knowledge bases: "Apache HBase source code", "HBase operational tools"
- Use hbck2: /usr/lib/hbase-operator-tools/hbase-hbck2.jar
- Prefix commands with sudo -u hbase
- Use aws s3 for S3-based rootdir
- Wait 300s after creating holes before hbck fixMeta (catalog janitor cycle)
- Use unassign instead of deprecated close_region
- If the region does not have .regioninfo in  <hbase-root-dir>/data/<namespace>/<table-name>/<region-encoded-name>/ but hbase:meta has that region's information and that region has been deployed on a healthy region server, you can use hbase shell to unassign and assign the region to re-generate .regioninfo
- Always add "sudo -u hbase hbase" before "hbase shell" and "hbase hbck" commands
## Job flow
Target: <your-job-flow-id>
Inconsistency to detect: All kinds of inconsistencies

You can trust or input “y” or “t” to grant Kiro to search through mcp and knowledge bases.

You may get some outputs like this: Kiro checked for any HBase issue.

Kiro summarized the examination results.

Kiro provided mitigation commands after Kiro summarized the issue.

Cleaning up

To avoid incurring future charges, delete the resources created during this walkthrough.

To clean up the resources

Delete the AWS CloudFormation stack from AWS Management Console:

Clean up Amazon EMR cluster resources (if created only for this walkthrough):

Verify resource cleanup in the AWS Console to verify that all resources are deleted and review your AWS bill to confirm no unexpected charges.

Important considerations:

Amazon OpenSearch Service domains take several minutes to fully delete
Amazon S3 buckets with versioning retain object versions
Use smaller instance types for development to optimize costs
Monitor usage with AWS Cost Explorer

Conclusion

In this post, we showed you how to build an AI-powered HBase troubleshooting solution that transforms manual log analysis into an automated workflow. By combining Amazon OpenSearch Service vector search with Amazon Bedrock-powered analysis through the Kiro CLI, operations teams can resolve complex HBase inconsistencies faster and gain deeper operational insights. The solution demonstrates how AI augments human expertise to improve operational efficiency, reducing HBase inconsistency resolution from hours to minutes and root cause identification from days to hours. Ready to transform your HBase operations? Get started with the GitHub repository and explore the Amazon OpenSearch Service documentation for additional guidance on vector search capabilities.

Acknowledgments

The author would like to thank Xi Yang, Anirudh Chawla, and Sasidhar Puthambakkam for their contributions to developing the technical solution. Xi Yang is a Senior Hadoop System Engineer and Amazon EMR subject matter expert at AWS. Anirudh Chawla is an AWS Analytics Specialist Solution Architect who helps organizations empower businesses to harness their data effectively through AWS’s analytics platform. Sasidhar Puthambakkam is a Senior Hadoop Systems Engineer and Amazon EMR Subject Matter Expert who provides architectural guidance for complex BigData workloads.

About the authors

How to use streamlined permissions for Amazon S3 Tables and Iceberg materialized views

Srividya Parthasarathy — Mon, 11 May 2026 18:59:36 +0000

Apache Iceberg has emerged as the open table format for data lakes. It handles petabyte-scale datasets, lets teams evolve schemas and partitions in place, and supports time travel and incremental processing for data lake management at scale. Amazon S3 Tables provide a fully managed Apache Iceberg table experience in Amazon S3, optimized for analytics workloads, and integrate with the AWS Glue Data Catalog so AWS analytics services such as Amazon Redshift, Amazon EMR, Amazon Athena, Amazon SageMaker, and AWS Glue query your data. Together, they form the foundation of a modern data lake architecture on AWS.

S3 Tables integrate with the AWS Glue Data Catalog using AWS Identity and Access Management (IAM) – based authorization. If you manage analytics workloads across these services, you can now define permissions across storage, catalog, and compute in a single IAM policy. This gives teams already using IAM a straightforward path to govern access to S3 Tables resources without changing their existing permission model. For fine-grained access controls, you can opt in to AWS Lake Formation at any time through the AWS Management Console, AWS Command Line Interface (AWS CLI), API, or AWS CloudFormation.

Iceberg materialized views created in the Glue Data Catalog extend this foundation by letting you store pre-computed query results as Iceberg data on Amazon S3. When a query repeats aggregations or joins across large datasets, the engine reads directly from the materialized view’s S3 location rather than reprocessing the base tables. A materialized view can reside in S3 Tables or in an S3 general purpose bucket, independent of where its base tables live, which lets you place pre-computed results wherever fits your access patterns and cost model best.

In this post, we walk through how to set up and manage S3 Tables in the AWS Glue Data Catalog, create and query Iceberg materialized views, and configure access controls that work across your analytics stack with IAM-based authorization.

Solution overview

The above architecture illustrates how S3 Tables integrate with AWS Glue Data Catalog using IAM-based authorization, so you can define the necessary permissions across storage, catalog, and query engines in a single IAM policy. This permission model accelerates onboarding for new teams and workloads.

Key architecture components include:

Storage Layer: Data stored as Iceberg tables in Amazon S3 Tables

Catalog Layer: AWS Glue Data Catalog serves as the single metadata repository.

Compute Layer – Amazon Athena, AWS Glue, Amazon Redshift, and Amazon EMR connect to a single data Catalog to access Iceberg tables.

Security: AWS IAM authorizes access to resources in storage, catalog, and compute layers.

Prerequisites:

To follow along with this post, you must have an AWS account and an IAM role or user with appropriate permissions and familiarity to the following services:

IAM
AWS Glue Data Catalog
Amazon S3
Amazon Athena
Amazon Redshift
Amazon EMR

For the minimum permissions required for the role/user for metadata and data access, refer to required IAM permissions documentation.

Solution walkthrough

In this walkthrough, you will integrate S3 Tables with the AWS Glue Data Catalog, create Iceberg materialized views, and query data using multiple analytics engines. You will also learn to use materialized views when you have complex aggregations queried frequently but underlying data changes. You can follow these steps to implement the solution. It will take about 45–60 minutes to complete this walkthrough.

Setup S3 Tables and integrate with Glue Data Catalog

Navigate to Amazon S3 console:

On the left menu, select Table buckets.
Choose the Create table bucket button.

In the next screen, we will fill the name of the bucket as salesbucket. Please ensure the Enable Integration configuration is checked. This step integrates S3 Tables with AWS Glue Data Catalog.

Keep the other options as default and choose Create table bucket.
After it is created, you will be redirected back to the list of table buckets. Choose the table bucket salesbucket.
Select the Create table with Athena button.
Create a namespace in S3 Tables which is equivalent to a database in AWS Glue Data Catalog. Enter namespace (database) name as “sales” and click Create namespace.

Choose Create table with Athena, and a new tab will be open with the Amazon Athena console.
When the Amazon Athena console opens, you will see an example of a query to create a table and examples to insert rows in that table. You could use this query block by uncommenting the code and executing each statement individually by highlighting it. At the end, you will have data in the table.

Query S3 Tables and create materialized view using Amazon EMR:

To run the instruction on Amazon EMR, complete the following steps to configure the cluster:

Create an IAM role for the Amazon EMR instance profile following the Amazon EMR Management Guide. Add the following as policies and trust relationship for working on materialized views.

Replace ACCOUNT_ID with your AWS account ID, Instance_profile_role to the Amazon EMR instance profile role, and REGION with your AWS Region.

{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Sid":"GlueDataCatalogPermissions",
         "Effect":"Allow",
         "Action":[
            "glue:GetCatalog",
            "glue:GetDatabase",
            "glue:CreateTable",
            "glue:GetTable",
            "glue:GetTables",
            "glue:UpdateTable",
            "glue:DeleteTable"
         ],
         "Resource":[
            "arn:aws:glue:<REGION>:<ACCOUNT ID>:catalog",
            "arn:aws:glue:<REGION>:<ACCOUNT ID>:catalog/s3tablescatalog",
            "arn:aws:glue:<REGION>:<ACCOUNT ID>:catalog/s3tablescatalog/*",
            "arn:aws:glue:<REGION>:<ACCOUNT ID>:database/salesdb",
            "arn:aws:glue:<REGION>:<ACCOUNT ID>:database/salesdb/*",
            "arn:aws:glue:<REGION>:<ACCOUNT ID>:database/s3tablescatalog",
            "arn:aws:glue:<REGION>:<ACCOUNT ID>:database/s3tablescatalog/*",
            "arn:aws:glue:<REGION>:<ACCOUNT ID>:table/s3tablescatalog/*",
            "arn:aws:glue:<REGION>:<ACCOUNT ID>:table/*/*"
         ]
      },
      {
         "Sid":"S3TablesDataAccessPermissions",
         "Effect":"Allow",
         "Action":[
            "s3tables:GetTableBucket",
            "s3tables:GetNamespace",
            "s3tables:GetTable",
            "s3tables:GetTableMetadataLocation",
            "s3tables:GetTableData",
            "s3tables:ListTableBuckets",
            "s3tables:CreateTable",
            "s3tables:PutTableData",
            "s3tables:UpdateTableMetadataLocation",
            "s3tables:ListNamespaces",
            "s3tables:ListTables",
            "s3tables:DeleteTable"
         ],
         "Resource":[
            "arn:aws:s3tables:<REGION>:<ACCOUNT ID>:bucket/*"
         ]
      },
      {
         "Effect":"Allow",
         "Action":"iam:PassRole",
         "Resource":"arn:aws:iam::<ACCOUNT ID>:role/service-role/<Instance_profile_role>"
      }
   ]
}

Add the following to the trust policy in addition to existing:

 {
            "Sid": "",
            "Effect": "Allow",
            "Principal": {
                "Service": "glue.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }

Launch an Amazon EMR cluster 7.12.0 or higher with instance profile role created in the previous step and with Iceberg enabled. For more information, refer to Use an Iceberg cluster with Spark.
Connect to the primary node of your Amazon EMR cluster by using SSH, and run the following command to start a Spark application with the required configurations:

Replace bucket_name with your bucket name.

spark-sql \
  --conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
  --conf spark.sql.catalog.glue_catalog=org.apache.iceberg.spark.SparkCatalog \
  --conf spark.sql.catalog.glue_catalog.type=glue \
  --conf spark.sql.catalog.glue_catalog.warehouse=s3://<bucket_name> \
  --conf spark.sql.catalog.glue_catalog.glue.region=<region> \
  --conf spark.sql.catalog.glue_catalog.glue.id=<accountid>:s3tablescatalog/salesbucket \
  --conf spark.sql.catalog.glue_catalog.glue.account-id=<accountid> \
  --conf spark.sql.catalog.glue_catalog.client.region=<region> \
  --conf spark.sql.optimizer.answerQueriesWithMVs.enabled=true \
  --conf spark.sql.defaultCatalog=glue_catalog

Run the following queries to query the daily_sales table.

spark-sql ()> use sales;
spark-sql (sales)> select * from daily_sales;
2024-01-15 Laptop 900.0
2024-01-15 Monitor 250.0
2024-01-16 Laptop 1350.0
2024-02-01 Monitor 300.0
2024-02-01 Keyboard 60.0
2024-02-02 Mouse 25.0
2024-02-02 Laptop 1050.0
2024-02-03 Laptop 1200.0
2024-02-03 Monitor 375.0

Create Materialized view.

CREATE MATERIALIZED VIEW sales_mv as 
SELECT 
    product_category,
    COUNT(*) as units_sold,
    SUM(sales_amount) as total_revenue, 
    AVG(sales_amount) as average_price 
FROM 
    glue_catalog.sales.daily_sales 
GROUP BY 
    product_category;

A newly created materialized view is populated with the initial query results but does not update automatically as base table data changes. To keep it current, specify a REFRESH EVERY clause when creating the view. This accepts a time interval and unit, so you can define how often the materialized view is recomputed from the base tables.

Add refresh interval.

CREATE MATERIALIZED VIEW sales_mv 
SCHEDULE REFRESH EVERY 2 HOURS as 
SELECT 
    product_category,
    COUNT(*) as units_sold,
    SUM(sales_amount) as total_revenue, 
    AVG(sales_amount) as average_price 
FROM 
    glue_catalog.sales.daily_sales 
GROUP BY 
    product_category;

Alternatively, you can refresh them manually.

For manual full refresh, you can use the following command:

REFRESH MATERIALIZED VIEW sales_mv FULL;

For manual incremental refresh, you can use the following command:

REFRESH MATERIALIZED VIEW sales_mv;

For more details, refer to Refreshing materialized views.

Query the MV.

spark-sql (sales)> select * from sales_mv
Keyboard 1 60.0 60.0
Laptop 4 4500.0 1125.0
Mouse 1 25.0 25.0
Monitor 3 925.0 308.3333333333333

After the Iceberg materialized views are created, you can access them using IAM principals that have required IAM permissions to Glue Data Catalog resource and its underlying storage.

Iceberg materialized views are flexible in how they combine base tables and access control modes. Base tables can reside in S3 general-purpose buckets (with IAM or Lake Formation access control), in S3 Tables (through the s3tablescatalog catalog), or a combination of these—all within a single materialized view definition. The materialized view itself can use either IAM or AWS Lake Formation access control, independently of its base tables.

For more details, refer to How materialized views work with AWS Glue.

Query using Athena:

Additionally, you can query the same materialized view from Athena SQL. The following image shows the same query run on Athena and the resulting output.

Query using Amazon Redshift:

To query the S3 Tables in AWS Glue Data Catalog using Amazon Redshift, you must create a database in the default catalog in Glue Data Catalog that points to the S3 Tables catalog.

On the AWS Glue console, choose Databases, and then choose Add Database.

Choose the Glue Database resource link option, add a name for the database, choose salesbucket on the target catalog and sales as the target database. Then select Create database.

After creating the database, we will see the “salesdb” resource link under Databases on AWS Glue Data Catalog.

Create IAM role with the following policy for the Amazon Redshift schema creation. Replace the AWS Region and account ID for your account.

{
   "Version":"2012-10-17",
   "Statement":[
      {
         "Sid":"GlueDataCatalogPermissions",
         "Effect":"Allow",
         "Action":[
            "glue:GetCatalog",
            "glue:GetDatabase",
            "glue:CreateTable",
            "glue:GetTable",
            "glue:GetTables",
            "glue:UpdateTable",
            "glue:DeleteTable"
         ],
         "Resource":[
            "arn:aws:glue:<REGION>:<ACCOUNTID>:catalog",
            "arn:aws:glue:<REGION>:<ACCOUNTID>:catalog/s3tablescatalog",
            "arn:aws:glue:<REGION>:<ACCOUNTID>:catalog/s3tablescatalog/*",
            "arn:aws:glue:<REGION>:<ACCOUNTID>:database/salesdb",
            "arn:aws:glue:<REGION>:<ACCOUNTID>:database/salesdb/*",
            "arn:aws:glue:<REGION>:<ACCOUNTID>:database/s3tablescatalog",
            "arn:aws:glue:<REGION>:<ACCOUNTID>:database/s3tablescatalog/*",
            "arn:aws:glue:<REGION>:<ACCOUNTID>:table/s3tablescatalog/*",
            "arn:aws:glue:<REGION>:<ACCOUNTID>:table/*/*"
         ]
      },
      {
         "Sid":"S3TablesDataAccessPermissions",
         "Effect":"Allow",
         "Action":[
            "s3tables:GetTableBucket",
            "s3tables:GetNamespace",
            "s3tables:GetTable",
            "s3tables:GetTableMetadataLocation",
            "s3tables:GetTableData",
            "s3tables:ListTableBuckets",
            "s3tables:CreateTable",
            "s3tables:PutTableData",
            "s3tables:UpdateTableMetadataLocation",
            "s3tables:ListNamespaces",
            "s3tables:ListTables",
            "s3tables:DeleteTable"
         ],
         "Resource":[
            "arn:aws:s3tables:<REGION>:<ACCOUNTID>:bucket/*"
         ]
      }
   ]
}

Create an Amazon Redshift provisioned cluster or Amazon Redshift Serverless, attaching the IAM role created in previous step.

To access the AWS Glue Catalog and the resource link, you can now log in to Amazon Redshift as a local user. We use the admin user and Amazon Redshift Query Editor v2.

To create the external schema, you must run the following command: Replace ACCOUNT_ID with your AWS Account ID, IAM_ROLE to IAM role created for schema access, and REGION with your AWS Region.

CREATE EXTERNAL SCHEMA salesdb
FROM DATA CATALOG DATABASE 'salesdb'
IAM_ROLE 'arn:aws:iam::<ACCOUNT_ID>:role/<IAM_ROLE>'
REGION '<REGION>'
CATALOG_ID '<ACCOUNT_ID>';

After you have created the external schema, it will show up on the left side, under the dev database. The table that we created, daily_sales, is available and we can query directly from Amazon Redshift using a local user.

Cleanup:

After completing the walkthrough, follow these steps to remove the resources and avoid ongoing charges. These cleanup steps will permanently delete the data, including the daily_sales table and sales_mv materialized view. Make sure that you have backed up the data that you need to retain before proceeding.

To avoid incurring future charges, clean up the resources that you created during this walkthrough:

Remove the Glue Data Catalog resources
Delete the table bucket
Terminate and Delete the Amazon Redshift cluster
Terminate and Delete the Amazon EMR cluster
Delete the IAM roles/policies created

Conclusion

Amazon S3 Tables now integrate with AWS Glue Data Catalog through IAM-based authorization via a single IAM policy. By consolidating permissions for storage, catalog, and query engines into one IAM policy, you can streamline authorization with AWS analytics services like Amazon Athena, Amazon EMR, and AWS Glue. You can use this streamlined IAM authorization model to build your data lake faster while maintaining enterprise-grade security. For organizations with additionally granular data access requirements, AWS Lake Formation remains available to layer fine-grained access controls on top of this foundation. This is configurable through the AWS Management Console, CLI, API, or CloudFormation. This integration allows AWS analytics users to use IAM and scale their analytics capabilities with reduced operational complexity.

To learn more about to S3 Tables and integration with Glue Data catalog, visit: Amazon S3 Tables integration with AWS analytics services overview and Integrating with Amazon S3 Tables.

About the authors

Improve DynamoDB analytics with AWS Glue zero-ETL schema and partition controls

Raju Ansari — Mon, 11 May 2026 18:51:22 +0000

You store transactional data in Amazon DynamoDB and get single-digit millisecond performance. However, when you want to run analytics, machine learning (ML), or reporting on that same data, you face a gap: your flexible, semi-structured DynamoDB schemas don’t align with the flat, columnar formats that analytics engines require. Bridging this gap typically means building and maintaining custom ETL pipelines, which adds development cost and operational overhead.

AWS Glue Zero-ETL integration removes that pipeline work. It enables replication of your DynamoDB tables to Apache Iceberg tables in Amazon Simple Storage Service (Amazon S3), then query it directly with Amazon Athena. During setup, you can configure two capabilities that will shape how replicated data looks and performs: schema unnesting flattens nested attributes into individual columns, and data partitioning organizes data so your queries scan only what they need.

In this post, you learn how to replicate Amazon DynamoDB data to Apache Iceberg tables in Amazon S3 through a zero-ETL integration. We walk through the challenges that the DynamoDB nested, schema-flexible data model introduces for analytics workloads, and show you how to configure schema unnesting and data partitioning for a sample product catalog table. We also cover how to query the replicated data in Amazon Athena using standard SQL.

Semi-structured data meets analytics

Your product catalog in DynamoDB contains items with nested attributes like product details, pricing tiers, and inventory information. A typical item looks like this:

{
  "product_id": "P-1001",
  "name": "Wireless Headphones",
  "productdetails": {
    "brand": "AudioTech",
    "category": "Electronics",
    "weight_kg": 0.25,
    "specification": {
       "color": "Black",
       "storage": "128GB"
    }
  },
  "pricing": {
    "list_price": 79.99,
    "discount_pct": 10
  },
  "created_at": 1701388800000
}

This structure supports fast transactional reads and writes. However, when you replicate this data for analytics, you face two decisions:

You must decide whether to flatten nested maps like productdetails into individual columns or preserve them as-is.
You must choose how to organize the data on disk so that queries filtering by brand or date range scan only relevant partitions.

With AWS Glue Zero-ETL, you address both decisions through configurable schema unnesting and data partitioning.

Solution overview

You replicate data from your DynamoDB table through AWS Glue Zero-ETL into Apache Iceberg tables stored in Amazon S3, then query the results with Amazon Athena. The following diagram illustrates the end-to-end architecture:

AWS Glue zero-ETL ingests data from Amazon DynamoDB, writes it in Apache Iceberg format to your Amazon S3 data lake, and makes it available for SQL queries in Amazon Athena—with no pipelines to build or maintain. With this integration, you:

Save development time by skipping custom code and ETL job management
Keep DynamoDB performance intact because replication doesn’t consume table’s provisioned read/write capacity
Get data within 15 minutes of changes in the source table
Query with standard tools because data lands in Apache Iceberg format, an open table format that AWS natively supports for high-performance analytics

During setup, you configure two output settings:

Schema unnesting in Zero-ETL: You choose how nested attributes appear in the target. Flattening nested maps into individual columns streamlines your queries and reduces complexity.
Data partitioning in Zero-ETL: You choose how data is organized into partitions. When you filter on a partition column, the query engine reads only matching data instead of scanning everything, cutting both query time and cost.

Schema unnesting

When you create a zero-ETL integration, you can choose one of three unnesting options. Schema unnesting transforms complex, nested DynamoDB structures into formats that analytics engines can query directly, removing post-processing transformations.

Each option changes how nested DynamoDB attributes appear in the target table. The right choice depends on your analytics tools and how consistent your DynamoDB schemas are.

Option 1: No unnesting

This option preserves the original nested structure. DynamoDB maps and lists remain as structured columns in the target.

Using the product example, the target table retains productid and value as columns to hold DynamoDB partition key and a DynamoDB record respectively.

Recommended for: Workloads where your analytics tools natively support querying nested data and you want to preserve the DynamoDB structure unchanged.

Option 2: Unnest one level

This option flattens top-level maps into individual columns. Lists remain nested.

With this option, productdetails and pricing each become separate columns.

Recommended for: Scenarios where your DynamoDB items have a consistent schema and you want to balance structure preservation with query simplicity.

Option 3: Unnest all levels (default)

This option recursively flattens nested structures using dot notation and produces the flattest schema.

For the product table, this creates columns such as productdetails.brand, productdetails.category, productdetails.specification.color , productdetails.specification.storage , pricing.list_price, and pricing.discount_pct. The pricing map flattens similarly. Each column is directly queryable without nested access patterns.

Recommended for: Analytics tools that prefer flat schemas when your DynamoDB items have a reasonably consistent structure. Note that deeply nested or highly variable schemas can produce very wide tables.

Data partitioning

You can speed up your queries and reduce costs by partitioning your replicated data. Partitioning divides data into logical segments on disk.

When you include a filter on a partition column in your query, the query engine skips irrelevant segments entirely. This behavior is called partition pruning: instead of scanning the entire dataset, the engine reads only the data that matches your filter conditions. For large tables, partition pruning can reduce both query runtime and cost significantly.

Default partitioning

If you don’t specify partition columns, AWS Glue Zero-ETL partitions data using the DynamoDB primary key with bucketing. This approach supports general-purpose queries without requiring manual configuration. For specific query patterns or performance requirements, you can define custom partitioning strategies described in the subsections that follow.

Identity partitioning

Identity partitioning uses raw column values to create partitions. You apply this strategy to low-to-medium cardinality columns such as brand, category, or AWS Region. To partition the product table by productdetails.brand and create a separate partition for each brand, use this configuration:

{
  "partitionSpec": [
    {
      "fieldName": "productdetails.brand",
      "functionSpec": "identity"
    }
  ]
}

With this setup, AWS Glue creates one partition directory per unique brand value. When you query for a specific brand, Athena reads only that partition.

Important: Avoid identity partitioning on high-cardinality columns such as primary keys or timestamps. This creates many small partitions, which degrades both ingestion and query performance

Time-based partitioning

Time-based partitioning organizes data by timestamp at a chosen granularity: year, month, day, or hour. You apply this strategy to time-series data and time-range queries. To partition the product table by month on the created_at column, which stores epoch milliseconds, use this configuration:

{
  "partitionSpec": [
    {
      "fieldName": "created_at",
      "functionSpec": "month",
      "conversionSpec": "epoch_milli"
    }
  ]
}

The conversionSpec parameter tells AWS Glue how to interpret the source timestamp. Supported values: epoch_sec (Unix seconds), epoch_milli (Unix milliseconds), and iso (ISO 8601 format).

Note: The original column values remain unchanged. AWS Glue transforms only the partition column values to timestamp type in the target table

Multi-level partitioning

You can combine strategies for a hierarchical scheme. To partition first by month and then by brand, use this configuration:

{
  "partitionSpec": [
    {
      "fieldName": "created_at",
      "functionSpec": "month",
      "conversionSpec": "epoch_milli"
    },
    {
      "fieldName": "productdetails.brand",
      "functionSpec": "identity"
    }
  ]
}

This scheme supports efficient queries that filter by date range, brand, or both. Place higher-selectivity columns first in the hierarchy and align the scheme with your most common query patterns.

Best practices

Keep these guidelines in mind when you configure your integration:

Avoid identity partitioning on high-cardinality columns such as primary keys, timestamps, or system-generated IDs. This leads to partition explosion and degrades performance.
Apply only one time-based function per column. For example, don’t partition col1 by year, month, day, and hour simultaneously.
Match conversionSpec to your actual data format. If your timestamps are in epoch milliseconds, use epoch_milli, not epoch_sec or iso.
Choose granularity based on data volume. High-volume tables benefit from finer granularity (day or hour). Lower-volume tables work well with coarser granularity (month or year).
Account for timezone implications with ISO timestamps. AWS Glue Zero-ETL normalizes timestamp partition values to UTC.

Prerequisites

To implement the AWS Glue Zero-ETL integration with a DynamoDB source, you will need:

An AWS account with least privilege principle
An AWS Glue database (for example, ddb_zero_etl_demo_db) with an Amazon S3 bucket associated as the database location (setup instructions)
AWS Glue Data Catalog settings updated with an AWS Identity and Access Management (IAM) policy that grants fine-grained access control for zero-ETL (setup instructions)
Create an IAM role named zetl-role, to be used by zero-ETL to access data from your DynamoDB table
A DynamoDB source table (for example, product) configured for zero-ETL integration (setup instructions)

Walkthrough: Create the zero-ETL integration

Complete these steps to create a zero-ETL integration with DynamoDB as the source and Apache Iceberg tables in Amazon S3 as the target.

Step 1: Select the source type

Open the AWS Glue console.
In the navigation pane, under Data Integration and ETL, choose Zero-ETL integrations.
Choose Create zero-ETL integration.
Select Amazon DynamoDB as the source type, then choose Next.

[Figure 1: Selecting Amazon DynamoDB as the zero-ETL source type]

Step 2: Configure source and target

In Source details, select your DynamoDB table (for example, product).
In Target details:

- Select the current account as target.
- Choose the catalog and target database (for example, ddb_zero_etl_demo_db).
- Select the IAM role (for example, zetl-role).

[Figure 2: Configuring source DynamoDB table and target database]

Step 3: Configure output settings

Under Schema unnesting, select Unnest all fields.
Under Data partitioning, select Specify custom partition keys.
Enter the partition key (for example, productdetails.brand) and set the function to Identity.
Choose Next.

[Figure 3: Configuring schema unnesting and partition key settings]

Step 4: Set integration details

Optionally configure encryption and replication settings. The default refresh interval is 15 minutes.
Enter a name for the integration (for example, ddb-zero-etl-demo).
Choose Next.

[Figure 4: Configuring encryption and replication settings]

Step 5: Review and create

Review your settings and choose Create and launch integration.
The integration shows as Active within about a minute.

[Figure 5: Review and create summary]

[Figure 6: Integration active with successful status]

Query the replicated data

After the integration is active and the initial replication completes (typically 15–30 minutes), you can query the data in Amazon Athena.

Preview the replicated data

Open the Amazon Athena console.
In the query editor, select your target database (for example, ddb_zero_etl_demo_db).
Run a preview query:

SELECT * FROM "ddb_zero_etl_demo_db"."product"LIMIT 10;

Verify schema unnesting

With Unnest all fields selected, nested attributes appear as individual columns with dot notation:

SELECT "productdetails.brand", "productdetails.category", "pricing.list_price" 
FROM "ddb_zero_etl_demo_db"."product"
WHERE "productdetails.category" = 'Electronics';

Verify partition pruning

Queries that filter on the partition column (productdetails.brand) automatically skip irrelevant partitions:

SELECT product_id, name, "pricing.list_price"
FROM "ddb_zero_etl_demo_db"."product"
WHERE "productdetails.brand" = 'AudioTech';

[Figure 7: Athena query to retrieve the data from Apache Iceberg lakehouse]

You can verify the partition structure by navigating to the Amazon S3 bucket associated with your database. The data organizes into directories like:

[Figure 8: Amazon S3 bucket organization for the identity partition productdetails.brand]

Clean up

To avoid ongoing charges, delete the resources in this order:

Delete the zero-ETL integration. In the AWS Glue console, navigate to Zero-ETL integrations, select your integration, and choose Delete. Existing replicated data remains in the target, but new changes stop replicating.
Delete the replicated table. In the AWS Glue Data Catalog, navigate to Tables, select the replicated table, and delete it.
Delete the AWS Glue database. In the Data Catalog, select the database and delete it.
Delete the Amazon S3 data. Empty and delete the S3 bucket associated with the database.
Delete the DynamoDB table. If you created it for this walkthrough, delete the source table.
Delete IAM resources. Remove the IAM role and policies created for the integration.

Conclusion

You configured schema unnesting and data partitioning for a DynamoDB zero-ETL integration, replicated a product catalog table to Apache Iceberg tables in Amazon S3, and verified the results in Amazon Athena. Unnesting flattened nested attributes into directly queryable columns. Partitioning helped the query engine skip irrelevant data, reducing both query time and cost. To take your integration further, try monitoring replication lag and data freshness with Amazon CloudWatch metrics. You can also experiment with different partitioning strategies on a staging table before applying them to production workloads, testing time-based partitioning alongside identity partitioning to find the optimal scheme for your query patterns. For broader analytics coverage, query the same Iceberg tables from Amazon Redshift Spectrum or Amazon EMR alongside Athena. For more details, explore these resources:

About the authors

How to build a cross-Region resilience for Amazon OpenSearch Service with Amazon MSK

Sriharsha Subramanya Begolli — Mon, 11 May 2026 18:46:43 +0000

Cross-Region resilience for Amazon OpenSearch Service has historically been a complex challenge, relying on S3-based snapshots or cross-cluster replication that demand intricate manual failover procedures often resulting in hours of downtime, data inconsistencies, and significant lag during outages, or other operational disruptions. To overcome these limitations and help businesses stay focused on their core objectives, we’ve developed a solution that automatically maintains synchronized data across AWS Regions while supporting active-active operations in both AWS Regions.

AWS offers two OpenSearch offerings, namely Amazon OpenSearch Service, a managed cluster-based service where you provision and manage OpenSearch domains (nodes, storage, scaling), and Amazon OpenSearch Serverless, a serverless option where AWS automatically manages infrastructure and scaling and you create collections for your search or analytics workloads. OpenSearch Service provides high availability (HA) within an AWS Region through its Multi-AZ deployment model and provides Regional resiliency with cross-cluster replication. Amazon Managed Streaming for Apache Kafka (Amazon MSK) Replicator is an Amazon MSK feature that you can use to reliably replicate data across Amazon MSK clusters in different or the same AWS Region.

In this post, we outline the solution that provides cross-Region resiliency without needing to reestablish relationships during a fail-back, using an active-active replication model with Amazon OpenSearch Ingestion (OSI) and Amazon Managed Streaming for Apache Kafka (Amazon MSK). This solution applies to both OpenSearch Service managed clusters and Amazon OpenSearch Serverless collections. We use Amazon OpenSearch Serverless as an example for the configurations in this post.

Solution overview

In this solution we use Amazon MSK Replicator for bidirectional cross-Region data replication, with OSI pipelines to index data into Amazon OpenSearch Serverless collections in each AWS Region. While the S3 based approach serves the purpose, Amazon MSK Replicator provides near real-time replication with identical topic naming, which supports active-active operations. Amazon MSK Replicator provides automatic loop prevention and consumer group offset synchronization, enabling seamless cross-Region failover. You can find the code for the entire solution in the GitHub repo.

Your architecture will follow a Regional-first approach where data sources write to a local Amazon MSK cluster within their AWS Region. In this sample deployment, an AWS Lambda function serves as the producer, streaming data into the MSK cluster. OSI pipelines consume the incoming data from the local MSK cluster and persist it to an Amazon OpenSearch Serverless collection within the same AWS Region. To achieve cross-Region data synchronization, Amazon MSK Replicator facilitates bidirectional replication between the Amazon MSK clusters, preserving the same topic names across both environments. This design validates that Amazon OpenSearch Serverless collections in each AWS Region maintain identical datasets, provides low-latency search capabilities and high availability for globally distributed workloads.

Prerequisites

Deploy the AWS Cloudformation template to install the prerequisites. The solution has the following prerequisite steps:

Set up Amazon Virtual Private Cloud (Amazon VPC) infrastructure in both Regions
1. Create Amazon VPCs with private subnets in at least two or three Availability Zones for high availability at the AWS Region level
2. Configure Network Address Translation (NAT) Gateways for outbound internet access from private subnets
3. Use non-overlapping CIDR blocks
Establish Amazon OpenSearch Serverless collections in both AWS Regions
Create Amazon OpenSearch Serverless Collections for log analytics
Configure encryption, network, and data access policies
Create Amazon VPC endpoints for private access
Configure MSK clusters in both AWS Regions
Enable AWS Identity and Access Management (IAM) authentication (SASL/IAM)
Enable Multi-VPC connectivity (required for Amazon MSK Replicator and OSI)
Configure MSK cluster policies to allow kafka.amazonaws.com and osis-pipelines.amazonaws.com service principals
Configure IAM permissions for pipeline and replication access
Create IAM roles for the OSI pipelines with permissions to access Amazon Managed Streaming for Apache Kafka and Amazon OpenSearch Serverless.
Create IAM roles for the Amazon MSK Replicator with permissions for cross-Region access to Amazon Managed Streaming for Apache Kafka clusters.

This AWS CloudFormation template helps you in deploying all of the required configurations with primary AWS Region as us-east-1 and secondary AWS Region as us-west-2.

The following snippets shows the configuration for the OSI pipeline, which writes data from Amazon MSK to Amazon OpenSearch Serverless. The OSI pipeline uses MSK as a source with IAM authentication.

version: "2"
kafka-pipeline:
source:
kafka:
acknowledgments: true
topics:
- name: "opensearch-data"
group_id: "osi-consumer-group-primary"
aws:
msk:
arn: "arn:aws:kafka:us-east-1:<aws-acccount-id>:cluster/production-msk-primary/CLUSTER_ID"
region: "us-east-1"
sts_role_arn: "arn:aws:iam::<aws-acccount-id>:role/production-osi-pipeline-primary-role"
sink:
- opensearch:
hosts:
- "https://<OPENSEARCH_SERVERLESS_COLLECTION_ID>.us-east-1.aoss.amazonaws.com"
index: "application-logs-${yyyy.MM.dd}"
aws:
serverless: true
region: "us-east-1"
sts_role_arn: "arn:aws:iam::<aws-acccount-id>:role/production-osi-pipeline-primary-role"
dlq:
s3:
bucket: "production-opensearch-dlq-us-east-1"
region: "us-east-1"
sts_role_arn: "arn:aws:iam::<aws-acccount-id>:role/production-osi-pipeline-primary-role"

The OSI pipeline IAM Role has the required permission for Amazon MSK and Amazon OpenSearch Serverless to consume message data from the source and write data to the destination. For true active-active replication, sample deploys two Amazon MSK Replicators in each AWS Region. Each Amazon MSK cluster requires cluster policy to allow Amazon MSK Replicator and OSI to connect. To validate the bidirectional replication, the solution uses AWS Lambda functions to produce test messages to both Amazon MSK clusters.

When an application generates an event, it first publishes the message to an Apache Kafka topic in the Regional streaming cluster powered by Amazon Managed Streaming for Apache Kafka. In this sample deployment, an AWS Lambda function simulates application activity by producing events into the topic. These events are durably stored in the Apache Kafka partitions, providing a reliable buffer between producers and downstream consumers. An ingestion pipeline built using Amazon OpenSearch Ingestion continuously reads the event stream from the Apache Kafka topic and prepares the data for indexing. The pipeline then indexes the processed events into a collection in Amazon OpenSearch Serverless, making the data searchable in near real time.

At the same time, Amazon MSK Replicator replicates the Apache Kafka topic to a peer Amazon MSK cluster in a secondary AWS Region while preserving the topic structure. This makes the same event stream available in the secondary AWS Region without requiring changes to downstream consumers. An OpenSearch Ingestion pipeline in the secondary AWS Region consumes the replicated topic and indexes the events into its local OpenSearch Serverless collection. As events continue to flow through the system, both AWS Regions maintain synchronized datasets that can be queried independently. This architecture enables low-latency Regional search while maintaining a resilient, cross-Region copy of the indexed data.

Failover scenario and considerations

You can failover your application to the Amazon OpenSearch Serverless collection in the other AWS Region and continue operations without interruption. The data present before the impairment is available in both collections. Upon recovery, Amazon MSK Replicator and OSI pipelines automatically resume operations without manual intervention. Data that you write to the healthy AWS Region during the impairment is automatically backfilled to the recovered AWS Region. For detailed step-by-step guidance, see disaster recovery section in GitHub repo.

When using Amazon MSK Replicator, be aware that cross-Region data transfer incurs additional costs. To help verify reliability, configure Dead Letter Queues (DLQ) for OSI pipelines to capture failed document ingestion. Additionally, monitor essential Amazon CloudWatch metrics including ReplicationLatency for tracking lag between clusters, DocumentsFailed for identifying ingestion issues, and MessagesInPerSec for observing message throughput.

Persistent buffering in OSI provides a built-in safety net that prevents data loss when data producers send information faster than your OpenSearch cluster can process it, removing the need to provision and manage separate buffering infrastructure. By using managed storage across multiple Availability Zones, this feature enhances data durability while dynamically allocating OpenSearch Compute Units (OCUs) for both buffering and data processing, which incurs additional costs. Persistent buffering isn’t enabled by default. Without it, the OSI pipeline relies on an in-memory buffer, which is volatile and has limited capacity for storing incoming data before processing.

Conclusion

In this post, we showed you how to achieve cross-Regional resiliency for Amazon OpenSearch Serverless and OpenSearch Service managed clusters. In our experiments, most writes of a few KBs of data completed within one to a few seconds between the two chosen AWS Regions. Replication lag between the AWS Regions depends on network delay between chosen Regions and the settings configured on Amazon Opensearch Ingestion (OSI) pipeline.

Refer to AWS Service Level Agreements (SLAs) and Amazon Opensearch Ingestion (OSI) for more details. You can also achieve active-passive replication for OpenSearch using OSI and Amazon Simple Storage Service (Amazon S3) as mentioned in another post Achieve cross-Region resilience with Amazon OpenSearch Ingestion.

About the authors

How to consolidate cross-Region S3 data into OpenSearch

David Venable — Fri, 08 May 2026 13:37:47 +0000

You might have data in Amazon Simple Storage Service (Amazon S3) buckets in different AWS Regions that you want available in a single Amazon OpenSearch Service domain or collection. Consolidating data across Regions provides unified analytics and searches, reduce operation complexity, and streamline your search infrastructure. We’re happy to announce that Amazon OpenSearch Ingestion pipelines can now read from S3 buckets in different Regions to ingest and consolidate data into a single OpenSearch Service domain or collection.

To consolidate this data across AWS Regions, you previously had to provide your own solution. Now Amazon OpenSearch Ingestion can help you accomplish this. In this post, I’ll show you how to use the new cross-Region support to ingest data from S3 buckets across multiple AWS Regions into a single OpenSearch Service domain or collection.

Amazon OpenSearch Ingestion (OSI) is a feature-rich data ingestion pipeline that you can use for many different purposes: observability, analytics, and zero-ETL search. Many customers use OpenSearch Ingestion to ingest data from Amazon S3 into OpenSearch Service domains and Amazon OpenSearch Serverless collections. Until now, you could only ingest from a single AWS Region at a time. Now that you can use OpenSearch Ingestion for cross-Region S3 ingestion, I’ll show you how you can use it in two scenarios: batch processing using S3 scan, and streaming ingestion using Amazon Simple Queue Service (Amazon SQS) queues for AWS vended logs like Amazon Virtual Private Cloud (Amazon VPC) Flow Logs and AWS CloudTrail.

Prerequisites

Complete the following prerequisite steps:

Deploy an OpenSearch Service domain or OpenSearch Serverless collection in the Regions where you want to perform your search or analytics.
You need S3 buckets in at least two different Regions. You can use existing ones or create S3 buckets. You can use one in the same AWS Region as your OpenSearch Service domain or collection, or use two completely different Regions.
Upload objects with data into your S3 buckets. The data can be JSON, ND-JSON, Parquet, CSV, or plaintext formats.
Configure AWS Identity and Access Management (IAM) permissions needed for OSI. For instructions, see Amazon S3 as a source.
For cross-Region ingestion, you must now also include the s3:GetBucketLocation permission. This gives the pipeline the ability to determine which AWS Region the bucket is located in.

After you complete these steps, you can either set up your Amazon OpenSearch Ingestion pipelines for batch or streaming scenarios. In the following sections, I’ll give you recommendations on when to choose which approach, and I outline the steps for creating your pipeline.

Batch scenarios

You can use the OpenSearch Ingestion S3 scan capability to read batch data from S3. You might find this approach useful when your data is written to S3 on a schedule. To perform a cross-Region S3 scan, you only specify the buckets that you’re reading from when you create the OpenSearch Ingestion pipeline.

The following diagram shows the design for an OpenSearch Ingestion pipeline in us-west-2 reading from S3 buckets in us-east-1 and eu-west-1 and writing that data into an OpenSearch Service domain in us-west-2.

Next, you will create an OpenSearch Ingestion pipeline. You must create this pipeline in the same Region as your OpenSearch Service domain or collection.

version: "2"
s3-scan-cross-region:
  source:
    s3:
      compression: automatic
      codec:
        json:
      scan:
        buckets:
          - bucket:
              name: amzn-s3-demo-bucket1
          - bucket:
              name: amzn-s3-demo-bucket2
      aws:
        region: us-west-2

  sink:
    - opensearch:
        hosts: [ "https://search-mydomain-abcdefghijklmn.us-west-2.es.amazonaws.com" ]
        index: s3_scan_cross_region
        aws:
          region: us-west-2

The previous pipeline configuration supports the JSON codec. You might want to configure a different codec if your data isn’t a large JSON object.

You can now query your OpenSearch Service domain or collection to see the data that you ingested.

Streaming scenarios: AWS vended logs

Like many of our customers, you might want to ingest S3 data from different AWS Regions into OpenSearch Service. A common reason is to consolidate AWS vended logs. For example, VPC Flow Logs, CloudTrail data, and load balancer logs. For these scenarios, you can configure OpenSearch Ingestion pipelines to read from an Amazon SQS queue to stream data into your OpenSearch Service domain or collection.

These AWS vended logs write to Amazon S3 in the same AWS Region as the service running it. For example, VPC Flow Logs will be in the same AWS Region as your Amazon VPC. You can use OpenSearch Ingestion to consolidate these logs into one AWS Region. In the VPC Flow Logs example, you can consolidate your VPC Flow Logs from multiple AWS Regions into a single OpenSearch Service domain or collection to analyze network patterns from your different Amazon VPCs.

The following diagram outlines the overall setup. It shows an example of sending AWS vended logs from us-east-1 and eu-west-1 to an OpenSearch Service domain in us-west-2. You can change the AWS Regions depending on your specific needs.

You must configure your vended logs to write log events to Amazon S3 buckets in their respective AWS Regions. Using VPC Flow Logs as our example, you can configure VPC Flow Logs for your VPCs.
Create an Amazon SQS queue in the same AWS Region as your OpenSearch Service domain.
Amazon S3 doesn’t send notifications to cross-Region Amazon SQS queues, so you will use intermediate Amazon Simple Notification Service (Amazon SNS) topics to consolidate the notifications from multiple Regions into one queue. For each S3 bucket, create an SNS topic.
Configure S3 Event Notifications for SNS. You will do this for each S3 bucket and each SNS topic.
SNS can send cross-Region notifications to SQS. Create a subscription from each SNS topic that you created in step 3 to the single SQS queue you created in step 2.
Configure your pipeline role to read from SQS and read from the relevant S3 buckets.

Now create an OpenSearch Ingestion pipeline in the same AWS Region as your OpenSearch Service domain.

version: "2"
s3-sqs-cross-region:
  source:
    s3:
      notification_type: sqs
      codec:
        newline:
      sqs:
        queue_url: https://sqs.us-west-2.amazonaws.com/123456789012/amzn-s3-demo-all-regions
      aws:
        region: us-west-2

  sink:
    - opensearch:
        hosts: [ "https://search-mydomain-abcdefghijklmn.us-west-2.es.amazonaws.com" ]
        index: s3_sqs_cross_region
        aws:
          region: us-west-2

The previous pipeline configuration supports the JSON codec. You might want to configure a different codec if your data is not a large JSON object.

Next, upload objects with data into your S3 buckets. By uploading data, S3 will send notifications to SNS and then the SQS queue.

You can now query your OpenSearch Service domain or collection to see the data that you ingested.

Here is what makes this possible and what is different. The SQS queue receives the event notifications for the buckets. Before the cross-Region feature of OpenSearch Ingestion, the pipeline could see these events, but couldn’t access the S3 bucket even if the permissions were granted. Now, the pipeline will determine the AWS Region that the bucket is in, access an AWS Security Token Service (AWS STS) token for the AWS Region of the bucket. Using the STS token from the same Region as the S3 bucket allows the pipeline to read and access the data.

Using the AWS Console

When you create the pipeline using the OpenSearch Ingestion console, you will have options to select a blueprint for your use-case. These blueprints help you create pipelines for various vended log types only by selecting your SQS queue and OpenSearch domain. The blueprint handles the data type mappings for you by including appropriate processors. You can use these blueprints as a starting point and modify your processors for your specific requirements.

Clean up resources

When you’re done testing this out, use the following resources to delete the resources that you created.

If you set up a batch pipeline:

Delete the OpenSearch Ingestion pipeline.

If you set up a streaming pipeline:

Delete the OpenSearch Ingestion pipeline.
If you created an SQS queue, delete the SQS queue.
If you created SNS topics, delete the SNS topics.
If you configured AWS vended logs you can delete those logging configurations. This example used VPC Flow Logs. For instructions on how to do so, see Delete the Flow Logs.

For both pipelines, these steps help you delete the common resources.

Delete the IAM roles that you created for your pipeline.
Delete the S3 objects that you uploaded and the S3 bucket.
Delete the Amazon OpenSearch domain or the Amazon OpenSearch Serverless collection.

Conclusion

In this post, I showed you how you can use Amazon OpenSearch Ingestion to ingest data from Amazon S3 buckets in different AWS Regions. I showed that this works for both batch scan and streaming scenarios. The feature offers you a straightforward way to consolidate your data from other Regions into one OpenSearch Service domain or collection.

To get started with the cross-Region S3 source, refer to the OpenSearch Ingestion documentation or try creating a pipeline from one of our blueprints using the OpenSearch Ingestion console. You can read about the codecs that OpenSearch Ingestion offers for parsing your S3 objects. You can also learn how about the various processors that OpenSearch Ingestion offers, so you can transform and enrich your data to meet your needs.

You can also use OpenSearch Ingestion for cross-Region and cross-account. To do this, you must grant cross-account permissions on your S3 bucket. You must also make some changes to your pipeline configuration. Combining what I showed you in this post with the existing cross-account features greatly expands your ingestion options.

If you’re ready to take your streaming ingestion analytics to the next level you can read about how to generate metrics from logs and even how to send those derived metrics to Amazon Managed Service for Prometheus.

Have you tried out the cross-Region capabilities of OpenSearch Ingestion? Share your use-cases and questions in the comments.

About the authors

Enable real-time mainframe analytics with Precisely Connect and Amazon S3

Supreet Padhi, Rochelle Grubbs — Fri, 08 May 2026 13:29:29 +0000

This is a guest post by Supreet Padhi, Technology Architect, Strategic Technologies, and Rochelle Grubbs, Senior Director, Solution Architect at Precisely in partnership with AWS.

Business leaders face a critical challenge to enable real-time analytics. Their most valuable data sits in mainframe systems that reliably process billions of transactions daily, but extracting value for modern analytics and AI remains complex and costly. Traditional mainframe-to-cloud integration approaches require multi-step replication with intermediary systems, creating operational overhead, latency, and data integrity risks. This complexity delays insights, increases infrastructure costs, limits agility, and blocks organizations from using AI and machine learning on their mainframe data.

Precisely, a global leader in data integrity with over 12,000 customers including 95 of the Fortune 100, has announced an expansion of its collaboration with AWS through new enhancements to Precisely Connect. Precisely is an AWS Data and Analytics ISV Competency and AWS Migration and Modernization ISV Competency partner. Precisely has service specializations in Amazon Redshift and Amazon Relational Database Service (Amazon RDS).

In Stream mainframe data to AWS in near-real time with Precisely and Amazon MSK, we showed you how to set up mainframe CDC and the AWS Mainframe Modernization – Data Replication for IBM z/OS Amazon Machine Image (AMI) available in AWS Marketplace. In this post, we discuss how you can use Precisely Connect to enable real-time, direct replication of mainframe data to Amazon Simple Storage Service (Amazon S3), and how your organization can extend this foundation using Amazon S3 Tables for advanced analytics.

Real-time mainframe data access

Organizations that can connect their mainframe environments with modern cloud platforms can gain advantages through improved agility, reduced operational costs, and enhanced analytics capabilities.For example, moving appropriate analytics and reporting workloads to the cloud can significantly reduce mainframe operational costs while maintaining performance and reliability. Real-time data access makes insights available within seconds rather than waiting for batch processing cycles, enabling faster responses to market changes and customer needs. Eliminating bulk data extracts and intermediary systems also reduces infrastructure and maintenance expenses. This frees IT resources to focus on higher-value initiatives.

However, implementing mainframe-to-cloud integrations presents unique technical challenges that require specialized solutions. These include converting mainframe character encoding (EBCDIC) to standard ASCII format and handling mainframe-specific data types such as packed decimal (COMP) fields. You also need to manage the complexity of VSAM (Virtual Storage Access Method) files that can store multiple record types in a single file, and maintain real-time synchronization without impacting mainframe performance.

Change Data Capture (CDC) technology addresses these challenges through incremental data movement that eliminates disruptive bulk extracts by streaming only changed data to cloud targets, minimizing system impact and ensuring data currency. Real-time synchronization keeps cloud applications in sync with mainframe systems, enabling immediate insights and responsive operations.

Precisely Connect: Real-time data replication to Amazon S3

With Precisely Connect, you can replicate data directly from mainframes to Amazon S3 in real time, eliminating the need for intermediaries and simplifying modernization.Data flows directly from mainframe sources, including Db2 z/OS, IMS, and VSAM, to Amazon S3, eliminating intermediary steps and reducing both latency and operational complexity. You can move mainframe data directly to Amazon S3 data lakes and analytics platforms without managing complex, multi-step replication processes.

The simplicity of this approach reduces maintenance overhead and integration complexity by removing the need for staging servers, middleware, or batch processing systems. After data lands in Amazon S3, it becomes immediately available for downstream AWS workloads. You can use Amazon Athena for SQL queries, AWS Glue for ETL and data cataloging, Amazon EMR for big data processing, Amazon SageMaker AI for machine learning, and Amazon Quick Sight for business intelligence dashboards.

Solution overview

Here we present a solution architecture for streaming mainframe data changes from Db2z through AWS Mainframe Modernization – Data Replication for IBM z/OS AMI directly to Amazon S3 and then using Amazon S3 Tables for advanced analytics capabilities.

By introducing direct S3 replication and streamlining deployment through the pre-configured AWS Marketplace AMI, you can deploy in minutes rather than weeks. This creates new possibilities for data distribution, transformation, and consumption. This architecture offers several key benefits:

Simplified deployment – Accelerate implementation using the preconfigured AWS Marketplace AMI
Direct replication – Eliminate intermediary systems by streaming data directly to Amazon S3, reducing latency and operational overhead
Real-time synchronization – Capture changes as they occur on the mainframe, ensuring downstream applications operate on current data
Flexible analytics options – Use S3 Tables for Iceberg-compatible tabular data storage
Comprehensive AWS integration – Gain immediate access to Amazon EMR, Amazon Athena, AWS Glue, Amazon SageMaker AI, and Amazon Quick Sight
Natural language data access – Through the MCP Server for Amazon S3 Tables, AI assistants can interact with structured data using conversational interfaces without needing to write SQL queries.

Prerequisites

To complete the solution, you need the following prerequisites:

Precisely components

AWS Mainframe Modernization – Data Replication for IBM z/OS – Deploy this Precisely Connect AMI from AWS Marketplace. This pre-configured image contains the Apply Engine and Controller Daemon components required for replicating mainframe data changes to Amazon S3.
Precisely Connect CDC Capture/Publisher – Deploy the Precisely Connect CDC Capture/Publisher on your mainframe environment. This component captures changes from Db2z logs and streams them to the Apply Engine over TCP/IP.

For detailed setup and configuration steps for Precisely components, refer to our previous post Stream mainframe data to AWS in near-real time with Precisely and Amazon MSK.

Connectivity requirements

Have network connectivity established between your mainframe environment and AWS using your organization’s approved connectivity method (such as AWS Direct Connect or VPN).
Verify that firewall rules allow TCP/IP communication between the mainframe Capture/Publisher and the Apply Engine.

AWS analytics components (optional extension)

After mainframe data lands in Amazon S3, your organization can extend its analytics capabilities using AWS services. One approach is to use Amazon EMR streaming jobs to process and write data to Amazon S3 Tables. After the data is stored in S3 Tables, the data can be queried directly using Amazon Athena for ad-hoc SQL analysis. This extension is optional and represents one of several ways to consume and analyze mainframe data after it reaches Amazon S3.

The following diagram illustrates the solution architecture.

Capture/Publisher – Connect CDC Capture/Publisher captures Db2 changes from Db2 logs using IFI 306 Read and communicates captured data changes to a target engine through TCP/IP.
Controller Daemon – The Controller Daemon authenticates all connection requests, managing secure communication between the source and target environments.
Apply Engine – The Apply Engine receives the changes from the Publisher agent and applies the changed data to the target Amazon S3.
Amazon S3 – Serves as the scalable data lake foundation where replicated mainframe data lands.
Amazon EMR streaming job – As data arrives, an instance of the Amazon EMR streaming job writes the data to target tables in Amazon S3 Tables.
Amazon Athena – Queries data stored in Amazon S3 Tables using standard SQL.

This architecture provides a clean separation between the data capture process and the data consumption process, allowing each to scale independently. When CDC data arrives in Amazon S3, you can use Amazon S3 Tables to store Db2 z/OS, VSAM, and IMS data in an open table format (Apache Iceberg) that is ready for analytics, providing a flexible path to mainframe modernization.

Quantifiable business value

Organizations implementing this solution typically see significant reductions in mainframe operational costs by offloading analytics and reporting workloads to the cloud. The elimination of intermediary infrastructure reduces both capital and operational expenses. The reduced maintenance burden frees IT resources to focus on strategic initiatives rather than managing complex replication systems. Speed and agility improvements are equally significant. Near real-time data availability, measured in seconds to minutes rather than hours to days, enables organizations to respond rapidly to market changes and operational events. The rapid deployment of new analytics use cases without requiring mainframe changes accelerates innovation. Organizations gain access to the full breadth of AWS services that can be used immediately after data lands in Amazon S3.

From an analytics and AI perspective, the solution creates a unified data platform that brings together mainframe, cloud-native, and third-party data sources. This unified view enables advanced machine learning on historical and current data, delivering predictive insights that drive proactive decision-making across the organization.

Customer story

A leading global payments provider put this into practice. The payments provider was struggling to generate timely analytics and insights from Point of Sale (POS) transaction data. As one of the world’s largest payment providers, they process hundreds of thousands of transactions per second. Users expect to swipe their card and have their transaction approved in seconds. New architecture was needed to keep up with customer demands and volume. By streaming mission-critical mainframe data directly to AWS in real time using Precisely Connect and landing it in Amazon S3 Tables, the company used storage built on the Apache Iceberg open standard. This approach enables high-performance analytics directly on mainframe data alongside cloud-native sources.

Conclusion

In this post, we demonstrated how Precisely Connect enables real-time, direct data replication from mainframes to Amazon S3, eliminating intermediaries and simplifying mainframe modernization.

Your organization can further extend this foundation with Amazon S3 Tables, purpose-built storage for Apache Iceberg tables in S3, enabling analytical applications to query the most current mainframe data using tools such as Amazon Athena, Amazon EMR, and Amazon Redshift.

Get started by deploying AWS Mainframe Modernization – Data Replication for IBM z/OS from AWS Marketplace and use Amazon S3 as a target for your mainframe use cases. Learn more about Precisely’s mainframe data integration capabilities at precisely.com. Contact AWS and Precisely experts to discuss your specific modernization challenges and design a proof-of-concept that demonstrates business value quickly.

About the authors

Build streaming applications on Amazon Managed Service for Apache Flink with AI-assisted guidance

Mazrim Mehrtens — Wed, 06 May 2026 15:45:57 +0000

Building production-ready Apache Flink applications requires learning a complex ecosystem. The learning curve is steep for newcomers, and even experienced Flink developers encounter complexity when scaling applications or troubleshooting production issues. With the new Kiro Power and Agent Skill for Amazon Managed Service for Apache Flink, you can get AI-assisted guidance for building, improving, and migrating streaming applications directly in your development environment, with recommendations that are grounded in best practices.

The Managed Service for Apache Flink Kiro Power and Agent Skill helps you navigate challenges across the Flink application lifecycle. For new development, the tool provides contextual guidance on application architecture, state management patterns, and connector selection. For existing application improvements, it analyzes your existing code to identify performance bottlenecks, reliability risks, and opportunities for improvement. If you’re upgrading from Apache Flink 1.x to 2.x, it detects compatibility issues and provides targeted refactoring steps to modernize your applications.

In this post, we walk through installing the Power and Skill, using Amazon Kinesis Data Streams to build a Kinesis Data Stream-to-Kinesis Data Stream streaming pipeline, and migrating an existing application to Flink 2.2. You can follow along with this use case to see how the Managed Service for Apache Flink Kiro Power can help you build a resilient, performant application grounded in best practices.

Solution overview

The Managed Service for Apache Flink Power/Skill works across multiple AI development tools, providing the same comprehensive guidance in each:

Kiro: Installs as a Power that automatically activates for Flink-related development activities
Cursor and Claude Code: Installs as an Agent Skill following the open Agent Skills standard
Other compatible agents: Compatible with tools supporting the Agent Skills specification

The Power/Skill provides guidance across the development lifecycle:

Best practices for Managed Service for Apache Flink application development
Maven dependency management and project structure
Resource improvements including KPU sizing, parallelism tuning, and checkpointing
Job graph architecture patterns and anti-patterns
Amazon CloudWatch monitoring and logging configuration
Flink 1.x to 2.2 migration guidance with state compatibility assessment
Connector-specific guidelines

The content is maintained in a single repository with use case specific entry points that are dynamically loaded depending on your needs.

Prerequisites

To use the tool, you need:

A development machine running macOS, Linux, or Windows with Java 11 or later (Java 17 for Flink 2.2) and Apache Maven installed
One of the following AI development tools:
- Kiro IDE
- Cursor
- Claude Code
- Other Agent Skills-compatible tools
Basic knowledge of Java and stream processing concepts (helpful but not required)
An AWS Identity and Access Management (IAM) role configured with access to create and run Managed Service for Apache Flink applications, create Amazon Simple Storage Service (Amazon S3) buckets for Flink application dependencies, create Kinesis Data Streams for streaming, and create IAM roles (required if deploying an application)

Installation

Installing as a Kiro Power

Open Kiro IDE.
Open Amazon Managed Service for Apache Flink and select Open in Kiro.

Choose Install to install the power.

Verify that the power is listed in the installed powers in the Kiro IDE.

The Power is now installed and automatically activates when you work on Flink-related development activities.

Installing as an Agent Skill

Agent Skills are discovered automatically by compatible tools through the SKILL.md file. Installation varies by tool:

Per-project installation (available in one project):

# For Cursor
git clone https://github.com/awslabs/managed-service-for-apache-flink-agent-steering-files.git .cursor/skills/flink

# For Claude Code
git clone https://github.com/awslabs/managed-service-for-apache-flink-agent-steering-files.git .claude/skills/flink

# For other Agent Skills-compatible tools
git clone https://github.com/awslabs/managed-service-for-apache-flink-agent-steering-files.git .agents/skills/flink

Personal installation (available across projects):

# For Cursor
git clone https://github.com/awslabs/managed-service-for-apache-flink-agent-steering-files.git ~/.cursor/skills/flink

# For Claude Code
git clone https://github.com/awslabs/managed-service-for-apache-flink-agent-steering-files.git ~/.claude/skills/flink

To verify the installation, interact with the skill in your preferred tool. In Claude Code, you can invoke it with /flink. In Cursor, type / in Agent chat and search for flink. For more information about Agent Skills, see the Agent Skills documentation.

Example: Building a Kinesis-to-Kinesis streaming pipeline

Rather than listing best practices, the Power/Skill actively guides you through making the right architectural decisions at each stage of development.

The following walkthrough demonstrates building a Flink application that reads from Amazon Kinesis Data Streams, analyzes events, and writes to another Kinesis stream. To follow along, run the same prompts in your Kiro IDE or other development tool. In the following prompts, we focus on local development and don’t create AWS resources. However, if you prompt the agent to create and deploy AWS resources, they will incur additional costs.

Starting the conversation

In the Kiro IDE, we can open a new chat in Vibe mode and prompt: “Help me build a Flink application that reads from Kinesis, processes events with windowed aggregations, and writes results to another Kinesis stream”:

What happens next

The AI assistant loads relevant guidance and walks you through the development process:

1. Confirm project requirements and details

Kiro automatically loads the Power based on the context of your prompt. The assistant then asks you questions about your use case to make sure that it builds the right application for your needs:

For the demo, we can prompt for a financial services use case: “I’m in financial services, so let’s use that as the use case. Try calculating volatility in real-time. And let’s use Flink 1.20 for now.”.

Kiro then confirms its assumptions and asks to proceed:

2. Project setup

After we confirm, Kiro generates a project with Flink 1.20 dependencies, Kinesis connectors, and proper scope configuration for Managed Service for Apache Flink deployment. The assistant creates the application structure with proper configuration separation between local development and Managed Service for Apache Flink service-level settings. Then, it creates a Kinesis source with proper deserialization and the sink with partitioning strategy, and windowed aggregation logic with proper state management, TTL configuration, and error handling.

Kiro also compiles the code to verify that it builds correctly. We can then proceed by asking Kiro to help us with running the application locally for testing.

3. Testing the project locally

You can run the application locally to test the results. We can prompt: “Can we run this locally using something like LocalStack to test deploying the job and also see some example results?”

Kiro creates the necessary Docker resources, testing scripts, and deployment steps to run the application locally with synthetic resources. If it encounters bugs or detects issues during the local testing process, it fixes them so that your deployment runs smoothly:

We can also access our local Flink UI to view our application:

4. Deploying the application to Managed Service for Apache Flink

Now that our application is running and generating results end-to-end, we can use the Power for other tasks. For example, you can get guidance on KPU allocation and parallelism settings based on your expected throughput, configure monitoring with CloudWatch metrics, logging, and dashboards for operational visibility, or set up infrastructure as code (IaC) for deploying in Managed Service for Apache Flink. We can prompt: “This is great! Can you help me deploy this application to Managed Service for Apache Flink? I’d like to use CloudFormation for deployment.”

Using the generated AWS CloudFormation templates and deployment scripts, we can deploy our application to AWS with associated resources for Kinesis Data Streams, Amazon S3 buckets for application JAR files, CloudWatch log groups, and IAM roles. Deploying these resources requires IAM credentials with associated permissions and will incur cost for the associated resource usage.

In a traditional workflow, you build your application, deploy to Managed Service for Apache Flink, then discover performance issues or configuration problems in production. You spend time debugging checkpoint failures, serialization errors, or resource bottlenecks.With the Power/Skill, the AI assistant catches these issues during development. When you need complex aggregation and processing logic, it helps you to do so in a way that uses resources efficiently with Flink’s scaling model. When you create an application bug that would cause a crash in production, it helps you identify it early with local end-to-end testing. The Power is configured with guidance and best practices to help with the development process from start to finish.

Example: Migrating to Flink 2.2

The Managed Service for Apache Flink Kiro Power and Agent Skill provide contextual advice specific to your situation. For new developers, it walks through the complete workflow from project setup to deployment, explaining Managed Service for Apache Flink-specific concepts along the way. For migration projects, it analyzes your existing code for Flink 2.2 compatibility issues and provides targeted refactoring guidance. The following example shows how the tool helps with the complex task of migrating from Flink 1.x to 2.2.

1. Assessing migration compatibility

We can ask Kiro to help us upgrade our project from the previous example to Flink 2.2: “I need to migrate my Flink 1.x application to 2.2. Can you help me identify compatibility issues?”

The assistant loads the Managed Service for Apache Flink Kiro Power and analyzes our code to identify potential issues:

In this case, using our generated project on Flink 1.20, Kiro identified the following compatibility issues for the upgrade:

Java 11 must move to Java 17 (minimum for Flink 2.2)
Flink version 1.20.3 must update to 2.2.0
The Kinesis connector must update from 5.1.0-1.20 to 6.0.0-2.0
Time references must change to java.time.Duration in window and lateness calls
The LocalStreamEnvironment instance of check must be removed (class removed in 2.2)
The isEndOfStream() override must be dropped from PriceTickDeserializer (method removed)
implements Serializable must be added to PriceTick and VolatilityResult

It also verified that some parts of the project are already Flink 2.2 compatible. The project uses the new Source Sink V2 APIs, the logging is 2.2 ready, the POJOs with no collection fields are state migration safe, and there are no Kryo registrations or TimeCharacteristic usage.

2. Implementing the migration

We can then ask Kiro to provide a step-by-step migration plan, both for updating the code and deploying to Managed Service for Apache Flink: “Can you help me update the application for Flink 2.2, and help me figure out the steps to upgrade my running Managed Service for Apache Flink application?”

Kiro evaluates the entire application code base. It evaluates it against the Power’s migration guidance and best practices, and provides a comprehensive analysis of the breaking changes, risks, and potential issues that would arise in the upgrade. After we approve the changes, Kiro then proceeds to make the necessary updates to make our application compatible with Flink 2.2 and provide us with a step-by-step upgrade process for the running application:

Now that Kiro has prepared the application for Flink 2.2, highlighted migration risks, and provided us with a clear path to execute the upgrade, you can test the upgrade process with confidence. From here, we can proceed to run our Flink 2.2 application locally, test the upgrade process in a development environment in Managed Service for Apache Flink, and then execute the upgrade in our production environment. If we run into issues, we can return to the Kiro Power to get advice, resolve issues, and unblock our upgrade.

Cleanup

To remove the Power/Skill installation:

For Kiro:

Open Kiro IDE.
Navigate to the Powers tab.
Uninstall the Amazon Managed Service for Apache Flink Power.

For Agent Skills:

# Remove per-project installation
rm -rf .cursor/skills/flink  # or .claude/skills/flink

# Remove personal installation
rm -rf ~/.cursor/skills/flink  # or ~/.claude/skills/flink
If you created Managed Service for Apache Flink applications or associated resources during development, clean the resources up:

Delete the Managed Service for Apache Flink application from the AWS Console.
Remove associated resources for sources and sinks, if created for development.
Delete CloudWatch log groups if no longer needed.

Conclusion

In this post, we showed you how the Kiro Power and Agent Skill for Amazon Managed Service for Apache Flink brings AI-assisted development to stream processing. You can use the tool to overcome Flink’s learning curve, build applications following Managed Service for Apache Flink best practices, and migrate to Flink 2.2 with confidence. To get started, choose the path that fits your workflow:

If you use Kiro, install the Power from the Powers tab and start a new chat with a Flink-related prompt.
If you use Cursor, Claude Code, or another Agent Skills-compatible tool, clone the GitHub repository into your skills directory and reference the steering/ files for guidance.
If you are new to Amazon Managed Service for Apache Flink, review the Amazon Managed Service for Apache Flink Developer Guide and the Apache Flink documentation to build foundational knowledge alongside the Power/Skill.

We welcome your feedback. Report issues or request features through GitHub Issues, or contribute improvements via pull requests.

About the authors

Migrating TLS Clients managed by third-party Certificate Authorities from self-managed Apache Kafka to Amazon MSK

Ali Alemi — Wed, 06 May 2026 15:41:21 +0000

Amazon Managed Streaming for Apache Kafka (Amazon MSK) is a fully managed streaming data service that handles Apache Kafka infrastructure and operations, so developers and DevOps managers can run Apache Kafka applications on AWS. Migrating to Amazon MSK requires no application code changes because Amazon MSK uses fully open source Apache Kafka, allowing existing applications and tools to work seamlessly. Amazon MSK with Express brokers streamlines Kafka management by providing up to 3x more throughput, 20x faster scaling, and 180x faster recovery with virtually unlimited storage, delivering resiliency and elasticity for mission-critical workloads.

Amazon MSK supports multiple authentication methods to secure client connections to Kafka clusters. These methods include:

IAM authentication for identity-based access control using AWS Identity and Access Management (IAM) policies.
Mutual TLS (mTLS) authentication where both clients and brokers authenticate each other using X.509 certificates.
SASL/SCRAM authentication for username and password-based authentication stored in AWS Secrets Manager.

When customers manage their own Kafka clusters and adopt mTLS, they typically rely on a third-party managed certificate authority (CA) to sign and verify both client and server certificates. This establishes a trust relationship where the CA acts as the trusted intermediary that validates the identity of both parties in the communication. When customers migrate their workloads to Amazon MSK, they must make sure that client certificates are signed by a CA that’s recognized and trusted by the MSK cluster. Amazon MSK recommends customers to use AWS Private Certificate Authority to create a private CA within AWS that MSK trusts. The migration path typically requires customers to either:

Generate new client certificates signed by an AWS Private CA that Amazon MSK recognizes, or
Establish a certificate chain where their existing third-party CA is subordinate to or trusted by an AWS-managed CA

In this post, we provide an approach to reuse your existing client certificates without reissuing them through AWS Certificate Manager (ACM) Private Certificate Authority. This solution enables an accelerated migration path by using your current third-party CA infrastructure. This removes the complexity and operational overhead of certificate re-issuance while maintaining the security posture that you’ve established with your existing mTLS implementation.

Solution overview

This approach involves four key steps to reuse your existing client certificates when migrating to Amazon MSK:

1. Create an Intermediate Certificate Using Your Third-Party CA

First, you generate an intermediate certificate authority (CA) certificate using your existing third-party CA infrastructure. This intermediate certificate acts as a bridge between your current certificate management system and AWS.

2. Import the Intermediate Certificate into AWS Certificate Manager as a Private CA

Next, you import this intermediate certificate into AWS Certificate Manager (ACM) as a Private Certificate Authority (PCA). This step establishes the intermediate CA within the AWS environment, making it recognizable to AWS services.

3. Integrate Amazon MSK with the PCA created from your Intermediate Certificate

You then configure your Amazon MSK cluster to use the ACM Private CA that contains your imported intermediate certificate. This integration enables Amazon MSK to recognize and trust certificates signed by your certificate authority.

4. Establish trust through common Certificate Authority

This approach works because both the AWS Private CA and your existing client certificates share the same root of trust—they’re both signed by your third-party CA. When Amazon MSK validates client certificates, it can trace the certificate chain back through the intermediate certificate in AWS Private CA to your trusted third-party CA, establishing a complete chain of trust without requiring certificate reissuance.This solution maintains your existing security architecture while enabling seamless migration to Amazon MSK, so your clients can continue using their current certificates without interruption.

Figure 1: Architecture diagram showing the integration of third-party Certificate Authority with Amazon MSK through AWS Certificate Manager Private CA

Implementation steps

In real-world scenarios, you already have a certificate authority that has issued certificates for your clients. For the purpose of this post, we use a code sample to create a self-signed certificate authority (using OpenSSL) to demonstrate the implementation steps. If you already have an existing certificate authority, you don’t need to create a root CA. You can generate an intermediate CA (Step 2) using your third-party CA and continue following the steps from where you import the intermediate CA certificate into AWS ACM as a Private Certificate Authority.

Step 1: Create a root Certificate Authority using OpenSSL

Cloning the repository

To clone the repository, complete the following steps:

Clone the repository using the following command:

git clone https://github.com/aws-samples/msk-third-party-mtls

Change to the repository’s root directory:

cd ./msk-third-party-mtls/openssl

Run the setup script:

make the script executable first:

chmod +x *.sh
./setup-ca.sh

You will be prompted to set up a password for the private key and the certificate. Here is an example of an output

Step 2: Create an intermediate CA for AWS ACM

In the AWS Private CA console, create a subordinate CA.

Enter distinguished name information matching your organization, Key algorithm and Create CA.
From the Actions menu, select Install CA certificate.
Download the Certificate Signing Request (CSR) file provided by AWS Private CA.

Download the CSR file to your local directory (“certs”) as “CSR.pem”.

Sign the ACM PCA issued CSR with your Root CA using the provided ./sign-acm-ca.sh in the code example.

Note: AWS Private CA retains the private key internally. You only sign their CSR and import the resulting certificate back to the AWS Private CA.

Step 3: Import signed certificate to AWS ACM Private CA

Go back to the AWS ACM console.
Select the CA that you created and select Install CA certificate.

Select External private CA as CA type.

Importing the certificate into AWS Certificate Manager

Open both files in a text editor:

acm-subordinate-ca-cert.pem
acm-ca-chain.pem

Do the following in the Certificate body field in AWS ACM:

Copy the entire content from the acm-subordinate-ca-cert.pem file and paste it into the text box.
Open the acm-ca-chain.pem file.
This file contains one certificate (The root CA certificate)
Do the following in the Certificate chain field in AWS ACM:
Copy the root CA certificate portion and paste it into the text box

Important: The certificate chain shouldn’t include the subordinate CA certificate itself—only the certificates above it in the chain (the root CA).

Choose Confirm and install to complete the process.

You should see the AWS Private CA turns into active state.

Step 4: Configure your MSK cluster for Mutual TLS authentication

Select your MSK cluster, go to Properties and edit the Security settings.
Select TLS client authentication through AWS Certificate Manager (ACM) as the access control method and choose the Subordinate CA that you created earlier. Then choose Save changes.

Step 5: Test your client

Run the certificate generation script

Execute the following command, replacing <client-name> with a descriptive name for your client (this will be used in the certificate filename):./generate-client-cert.sh <client-name>

Example:

./generate-client-cert.sh kafka-admin

Enter distinguished name information

When prompted, enter the distinguished name (DN) options. These should match your root CA settings except for the Common Name (CN):

Country (C): Match your root CA (for example, US)
State (ST): Match your root CA (for example, State)
Organization (O): Match your root CA (for example, Anycompany)
Organizational Unit (OU): Match your root CA (for example, IT)
Common Name (CN): Use a client-specific identifier (for example, kafka-admin or client)

Verify certificate files

After the certificate is generated, verify that the files were created successfully by running:ls ~/ca/certsYou should see files with your client name, including:

<client-name>.key (private key)
<client-name>.crt (certificate)
<client-name>.p12 (PKCS12 keystore)

Create Kafka client properties file

Create a new properties file for your Kafka client (for example, kafka-tls-client.properties) based on the provided kafka-admin-ssl.properties example file. Update the file paths to reference your newly generated client certificate files.

Example configuration:

security.protocol=SSL
ssl.keystore.location=/path/to/<client-name>.p12
ssl.keystore.password=your-keystore-password
ssl.key.password=your-key-password #omit if you didn’t set key password
ssl.keystore.alias=your-private-key-alias

Step 6: Testing the Kafka client connection

To test the Kafka client connection, do the following.

Set environment variables

First, set the required environment variables for your Kafka installation and MSK cluster:

export KAFKA_HOME=/home/ec2-user/kafka
export BOOTSTRAP_SERVERS=<your-msk-bootstrap-servers>

Note: Replace <your-msk-bootstrap-servers> with your actual Amazon MSK cluster bootstrap server endpoints (for example, b-1.mycluster.abc123.kafka.us-east-1.amazonaws.com:9094,b-2.mycluster.abc123.kafka.us-east-1.amazonaws.com:9094)

Run the Kafka list topics command

Execute the following command to verify that your client can successfully connect to Amazon MSK using mutual TLS authentication:

$KAFKA_HOME/bin/kafka-topics.sh \
  --bootstrap-server $BOOTSTRAP_SERVERS \
  --list \
  --command-config kafka-tls-client.properties

What this test does:

Connects to your Amazon MSK cluster using the TLS configuration in your properties file
Authenticates using your client certificate
Lists all available Kafka topics

Expected result: If successful, you should see a list of topics in your Kafka cluster (or an empty list if no topics exist yet).

If the connection fails, check:

Your bootstrap server endpoints are correct
You imported the private key, and certificate chain to your keystore
The paths in your properties file point to the correct keystore and truststore files
Your client certificate was properly imported
Your Amazon MSK cluster security settings allow TLS client authentication
Your Amazon MSK cluster references correct PCA ARN in AWS ACM

Troubleshooting

Enable debug mode to verify certificate handshake

To troubleshoot certificate issues and verify which certificates are involved in the TLS handshake, enable Java SSL debug mode:

export KAFKA_OPTS="-Djavax.net.debug=ssl:handshake:verbose"
$KAFKA_HOME/bin/kafka-topics.sh \
  --bootstrap-server $BOOTSTRAP_SERVERS \
  --list \
  --command-config kafka-tls-client.properties

What this debug mode shows:

The complete TLS handshake process
Which certificates are being presented by both client and server
The certificate chain validation steps
Which certificate from your truststore is being used for authentication

When this is helpful:

When you have multiple certificates in your truststore and need to identify which one is being used
When troubleshooting certificate chain validation issues
When verifying that the correct client certificate is being presented during authentication
When diagnosing certificate mismatch or trust issues

Reading the debug output:

Look for lines containing:

***Certificate chain – Shows the certificates being presented
Found trusted certificate – Indicates which certificate in your truststore matched
Cert path validation – Shows the certificate chain validation process

To disable debug mode after troubleshooting, simply unset the environment variable:

unset KAFKA_OPTS

Conclusion

This post presents a solution for migrating TLS clients from self-managed Apache Kafka to Amazon MSK while reusing existing third-party CA-signed certificates. The approach removes the need for certificate reissuance by instead creating an intermediate CA from the existing third-party CA, importing it into AWS Certificate Manager as a Private CA, and integrating it with Amazon MSK. This maintains the established chain of trust through the common certificate authority, enabling seamless migration without operational disruption while preserving the existing security architecture and mTLS implementation. To read more about the Amazon MSK security model, see Security in Amazon MSK.

About the authors

A guide to capacity planning for Airflow worker pool in Amazon MWAA

Boyko Radulov — Fri, 01 May 2026 15:42:45 +0000

In our previous post, A guide to Airflow worker pool optimization in Amazon MWAA, we explored when adding workers to your Amazon Managed Workflows for Apache Airflow (Amazon MWAA) environment actually solves performance issues, and when it doesn’t. We walked through patterns like high CPU utilization and long queue times where scaling may be appropriate, and anti-patterns like misconfigured Airflow settings and memory leaks where adding workers only masks the real problem. The key takeaway was clear: optimize first, scale second, and always let data drive the decision.

But what happens after you’ve done the optimization work? Your DAGs are efficient, your configurations are tuned, and your environment is running well. Then the business comes knocking: new regulatory requirements, additional data pipelines, expanded reporting. The workload is about to grow, and this time, you genuinely need more capacity.

This is where capacity planning comes in. Knowing how many workers to provision, before the new workload hits production, is the difference between a smooth rollout and a 5 AM SLA breach. In this post, we walk through a practical capacity planning framework for Amazon MWAA worker pools. Using a real-world financial services scenario, we show how to assess your current capacity, project future needs, calculate the right number of base workers, and set up monitoring to keep your environment healthy as workloads evolve.

Scenario: A financial services company needs to plan capacity for a 25% directed acyclic graph (DAG) increase to support new regulatory reporting requirements.

Current vs projected state

The following table compares the current and expected state after adding 25% more DAGs.

	Metric	Current	Projected	Change
1	DAGs	20	25	25%
2	Peak Tasks (5-7 AM)	80	104	+24 tasks
3	Environment Class	mw1.medium	mw1.medium	No change
4	Base Workers	8	11	+3 workers
5	Tasks per Worker	10 (mw1.medium default)	10	No change
6	Available Capacity	80 slots (8 × 10)	110 slots (11 × 10)	+30 slots
7	Peak Utilization	100% (80/80 slots)	95% (104/110 slots)	Improved
8	Critical SLA	7 AM market open	7 AM market open	No tolerance

Capacity planning goal: Reduce utilization from 100% to 95% to maintain service level agreement (SLA) compliance and handle unexpected spikes.

Understanding current capacity: The environment currently runs 8 base workers, providing 80 concurrent task slots (8 workers × 10 tasks per worker). During the 5-7 AM peak with 80 concurrent tasks, this represents 100% utilization, a risky level that leaves no headroom for unexpected spikes or volatility.
With the planned addition of 5 new regulatory reporting DAGs, peak concurrent tasks will grow to 104. To maintain healthy operations with adequate buffer, we need to increase to 11 base workers (110 slots), resulting in 95% peak utilization with 6 slots of breathing room.

Why 100% utilization is risky: Running at 100% task utilization means:

Zero buffer for unexpected spikes
Any additional task causes immediate queuing
No room for market volatility or data volume increases
High risk of SLA breaches during unpredictable events

Best practice: Maintain at least 5-15% headroom (85-95% utilization) for production workloads with critical SLAs.

Why this sizing:

Current: 80 tasks ÷ 80 slots = 100% utilization (at capacity – risky!)
Projected: 104 tasks ÷ 110 slots = 95% utilization (healthy with buffer)
Buffer: 6 slots (5% headroom) protects against unexpected volatility spikes
SLA protection: Adequate headroom prevents queuing during normal operations

Capacity analysis

Every team asks the same critical question: “How many workers do I need?” The process is to identify your peak concurrent tasks from Amazon CloudWatch metrics, dividing by your environment’s tasks-per-worker capacity, and adding a 5%-15% safety buffer.

Step 1: Identifying peak concurrent tasks from Amazon CloudWatch

To determine your peak workload, you need to analyze RunningTasks and QueuedTasks CloudWatch metrics for your Amazon MWAA environment. Navigate to Amazon CloudWatch and query the following key metrics:

Primary metrics for capacity planning:

RunningTasks: Number of tasks currently executing across all workers. This shows your actual concurrent task load.
QueuedTasks: Number of tasks waiting for available worker slots. High values indicate insufficient capacity.
AvailableWorkers: Current number of active workers in your environment.

How to find peak concurrent tasks:

Open the Amazon CloudWatch Console.
- Choose Metrics.
- Choose the MWAA namespace.
Select your environment name.
Add the RunningTasks metric.
Set time range to last 7-30 days.
Change statistic to Maximum.
Identify the highest value during your peak hours (for example, 5-7 AM).

Example query:
Note: The following query is conceptual and does not directly translate to Amazon CloudWatch-specific language. Please refer to the Query your CloudWatch metrics with CloudWatch Metrics Insights for more information.

SELECT MAX(RunningTasks) AS PeakConcurrentTasks
FROM MWAA_Metrics
WHERE Environment = 'prod-airflow'
  AND timestamp BETWEEN '2024-10-01' AND '2024-10-31'
  AND HOUR(timestamp) BETWEEN 5 AND 7;

In our scenario, this analysis revealed 80 concurrent tasks during the 5-7 AM window. With the planned 25% DAG increase, we project this will grow to 104 concurrent tasks.

Step 2: Calculate required workers

To calculate the number of required workers without queuing any tasks, use the following formula: Peak concurrent tasks ÷ Tasks per worker × Safety buffer = Required workers

In the projected scenario with 104 tasks at peak hours, using mw1.medium environment with default concurrency configuration and having a 5% safety buffer, we need 11 workers

104 peak tasks ÷ 10 tasks per worker × 1.06 buffer = 11 workers required to handle your workload without queuing during busiest periods.

Capacity monitoring and triggers

There are a few important Amazon CloudWatch metrics to monitor for environment health.

Key metrics to monitor

Monitor these five critical Amazon CloudWatch metrics to detect capacity issues:

QueuedTasks (>10 for >5 minutes indicates insufficient capacity)
RunningTasks (consistently at maximum suggests the need for more workers)
AdditionalWorkers (active for more than 6 hours daily signals the permanent worker problem)
Worker CPU (>85% sustained requires environment class upgrade or workload optimization)
Task Duration (+15% increase means reduced effective capacity per worker).

These metrics provide early warning signals to adjust capacity before SLA breaches occur.

	Metric	Threshold	Action
1	QueuedTasks	>10 for >5 minutes	Investigate capacity
2	RunningTasks	Consistently at max	Increase base workers
3	AdditionalWorkers	Active >6 hours daily	Increase base workers
4	Worker CPU	>85% sustained	Upgrade environment class
5	Task Duration	+15% increase	Review capacity per worker

Amazon CloudWatch monitoring queries

Note: The following queries are conceptual and do not directly translate to Amazon CloudWatch-specific language. Please refer to the Query your CloudWatch metrics with CloudWatch Metrics Insights for more information.

Queue depth during peak hours

SELECT AVG(QueuedTasks)
FROM MWAA_Metrics
WHERE Environment = 'prod-airflow'
  AND timestamp BETWEEN '05:00' AND '07:00'
GROUP BY 5m;

Worker utilization efficiency

SELECT AVG(RunningTasks) / AVG(AvailableWorkers * 5) * 100 AS UtilizationPercent
FROM MWAA_Metrics
WHERE Environment = 'prod-airflow';

Detect permanent worker problem

SELECT DATE(timestamp) AS date,
       AVG(AdditionalWorkers) AS avg_additional,
       MAX(AdditionalWorkers) AS max_additional
FROM MWAA_Metrics
WHERE AdditionalWorkers > 0
GROUP BY DATE(timestamp)
HAVING AVG(AdditionalWorkers) > 5;

Setting up alerts

You can configure these alarms to identify problems as soon as they are introduced.

Recommended Amazon CloudWatch alarms:

High queue depth alert
- Metric: QueuedTasks
- Threshold: > 10 for 2 consecutive 5-minute periods
- Action: Notify operations team
Permanent worker detection
- Metric: AdditionalWorkers
- Threshold: > 0 for 6+ hours
- Action: Review capacity planning
SLA risk alert
- Metric: QueuedTasks during 5-7 AM window
- Threshold: > 5 tasks
- Action: Page on-call engineer

When to revisit capacity planning

Conduct quarterly scheduled reviews to analyze trends and project growth. Also run immediate trigger-based assessments when:

DAG count increases >10% (or more than your safety buffer)
Performance degrades
Cost anomalies appear (indicating permanent workers)
Any SLA breach occurs.

This dual approach provides proactive capacity management while enabling rapid response to emerging issues.

	Trigger	Frequency	Action
1	Scheduled Review	Quarterly	Analyze trends, project growth
2	DAG Growth	>10% increase	Recalculate capacity needs
3	Performance Degradation	As observed	Immediate capacity assessment
4	Cost Anomalies	Monthly	Check for permanent workers
5	SLA Breaches	Any occurrence	Emergency capacity review

Decision matrix

The framework presents three capacity planning approaches, each optimized for different organizational priorities.

The Full Base Worker Provisioning strategy (the conservative path) sets base workers equal to the calculated requirement, eliminating queue times during peak periods and guaranteeing SLA compliance with predictable fixed costs, while automatic scaling handles only unexpected spikes—ideal for mission-critical workloads with strict SLA requirements.

The Minimal Base + Automatic Scaling approach (the cost-focused path) maintains minimal base workers at current levels and relies heavily on automatic scaling, accepting 3-5 minute delays during peak periods and SLA breach risks in exchange for lower baseline costs, though this requires intensive monitoring and carries explicit warnings about high SLA risk.

The Hybrid Approach (the balanced path) provisions base workers at 80% of the calculated requirement with automatic scaling covering the remaining 20%, resulting in 2-3 minute delays during spikes while balancing cost against performance—suitable for moderate SLA requirements with some budget constraints.

The comparison table contrasts queue times (under 30 seconds versus 2-3 minutes versus 3-5 minutes), SLA compliance levels (guaranteed versus high probability versus at-risk during peak), and ideal use cases (mission-critical predictable workloads versus moderate SLA requirements with budget constraints versus development environments with flexible SLA tolerance), enabling teams to make informed provisioning decisions aligned with their operational requirements and financial constraints.

Key takeaway

Effective capacity planning prevents both under-provisioning (SLA breaches) and over-provisioning (cost overruns).

Capacity planning principles

Calculate capacity needs BEFORE adding workload – Use peak task projections with 5-15% safety buffer
Size minimum workers for peak demand – Don’t rely on automatic scaling for predictable loads
Use automatic scaling only for unexpected spikes – Treat as safety net, not primary capacity
Target 85-95% utilization during peak hours – Ensures headroom for unexpected growth
Plan 5-15% headroom for unexpected growth – Production often differs from testing
Monitor AdditionalWorkers metric – If active >6 hours daily, increase base workers
Review quarterly + trigger-based assessments – Regular reviews plus immediate action on issues
Balance cost and performance based on SLA criticality – Business impact justifies infrastructure investment

Success metrics

Queue efficiency: Average queue time <30 seconds during peak
SLA compliance: >99.5% of critical tasks complete on time
Resource utilization: 85-95% during peak hours (optimal efficiency)
Cost predictability: <10% variance in monthly worker costs

Conclusion

Capacity planning is not a one-time exercise. It’s an ongoing discipline. The framework we’ve outlined gives you a repeatable process: measure your current peak utilization through CloudWatch metrics, project growth based on incoming workloads, calculate the required workers with an appropriate safety buffer, and monitor continuously to catch drift before it becomes an outage.

The financial services scenario in this post illustrates a common reality: running at 100% utilization during peak hours leaves zero room for the unexpected. By sizing to 95% peak utilization with a modest buffer, the team gained the headroom needed to absorb volatility without risking their 7 AM market-open SLA.

Whether you choose full base worker provisioning for mission-critical pipelines, a hybrid approach for moderate SLA requirements, or lean on automatic scaling for development workloads, the right strategy depends on your business context, not a one-size-fits-all rule. Pair your capacity plan with the CloudWatch alarms and review triggers we covered, and you’ll catch capacity gaps early.

Combined with the optimization-first approach from Part 1, you now have a complete toolkit: diagnose before you scale, optimize before you provision, and plan before you deploy. Your MWAA environment and your on-call engineers will thank you.

To get started, visit the Amazon MWAA product page and the Amazon MWAA console page.

If you have questions or want to share your MWAA capacity planning, leave a comment.

About the authors

A guide to Airflow worker pool optimization in Amazon MWAA

Boyko Radulov — Fri, 01 May 2026 15:41:26 +0000

Optimizing the Airflow worker pool configuration in Amazon Managed Workflows for Apache Airflow (Amazon MWAA), the AWS fully managed Apache Airflow service, is an important yet often overlooked strategy for scaling workflow operations. Tasks queued for longer periods can create the illusion that additional workers are the solution, when in reality the root cause might lie elsewhere. The decision to scale isn’t always straightforward. DevOps engineers and system administrators frequently face the challenge of determining whether adding more workers will solve their performance issues or only increase operational cost without addressing the root cause.

This post explores different patterns for worker scaling decisions in Amazon MWAA, focusing on the task pool mechanism and its relationship to worker allocation. By examining specific scenarios and providing a practical decision framework, this post helps you determine whether adding workers is the right solution for your performance challenges, and if so, how to implement this scaling effectively.

Main patterns

This section discusses the most frequently seen problems that raise the question if adding additional workers would improve the health of your environment.

High CPU

Airflow serves as a workflow management platform that coordinates and schedules tasks to be run on external processing services. It acts as a central orchestrator that can trigger and monitor tasks across various data processing systems like AWS Glue, AWS Batch, Amazon EMR, and other specialized data processing tools. Rather than processing data itself, Airflow’s strength lies in managing complex workflows and coordinating jobs between different systems and services.

In Analytics and Big Data environments, there is a prevalent misconception that saturated resources automatically warrant adding more capacity. However, for Amazon MWAA, understanding your workflow characteristics and optimization opportunities should precede scaling decisions.

As you scale up your workflows, resource utilization of the Airflow clusters naturally increases. When workers consistently operate at full capacity, it may seem intuitive to add additional compute resources. However, this approach often masks underlying inefficiencies rather than resolving them.

For example, in Amazon MWAA if you are running a single task that is consuming 100% of the available CPU on your Amazon MWAA worker, adding additional workers will not resolve the problem as the task is not optimized nor split into smaller parts. As such, increasing the number of minimum workers will not bring the expected effect but will only increase the operating costs.

When your Amazon MWAA workers are consistently running above 90% CPU or Memory utilization, you’ve reached a critical decision point. Before taking actions, it is essential to understand the root cause. You have three primary options:

Scale horizontally by adding additional workers to distribute the load.
Scale vertically by upgrading to a larger environment class for more resources per worker.
Optimize your DAGs and scheduling patterns to be more efficient and consume fewer resources.

Each approach addresses different underlying issues, and choosing the right path depends on identifying whether you are facing a capacity constraint, resource-intensive task design, or workflow inefficiency. For guidance on optimization strategies, please refer to Performance tuning for Apache Airflow on Amazon MWAA.

To monitor the CPUUtilization and MemoryUtilization on the workers, refer to the Accessing metrics in the Amazon CloudWatch console and choose the corresponding metrics.

Select a time window long enough to show usage patterns.
Set period to 1 Minute.
Set statistics to Maximum.

Long queue time

Sometimes Airflow tasks are stuck in a queued state for a long time, which prevents DAGs from completing on time.

In Amazon MWAA, each environment class comes with configured minimum and maximum worker nodes. Each worker provides a pre-configured concurrency, which is the number of tasks that can run simultaneously on each worker at any given time. The behavior is controlled through celery.worker_autoscale=(max,min).

For example, if you have minimum 4 mw1.small workers, with default Airflow configuration, you will be able to run 20 concurrent tasks (4 workers x 5 max_tasks_per_worker). If your system suddenly requires more than 20 tasks to execute concurrently, this will result in an autoscaling event. Amazon MWAA will decide how to scale your workers efficiently, and trigger the process. The autoscaling process, however, requires additional time to provision new workers resulting in additional tasks in queued status. To mitigate this queuing issue, consider the following:

If the CPU utilization on the workers is low, increasing the max value in celery.worker_autoscale=(max,min) can reduce the time tasks stay in queued state as each worker will be able to process more tasks concurrently. Airflow worker can take tasks up to the defined task concurrency regardless of the availability of its own system resources. As a result, the base worker may reach 100% CPU/Memory utilization before Autoscaling takes effect.
If you do not want to increase the task concurrency on the workers, increasing the minimum worker count can also be beneficial because having more available workers allows a higher number of tasks to run concurrently.

Scheduling delays

Adding new DAGs can not only affect your system resources, but it can also create uneven scheduling patterns. Some DAGs may experience delayed execution because of resource competition, even when the overall environment metrics appear healthy. This scheduling skew often manifests as inconsistent task pickup times, where certain workflows consistently wait longer in the queue while others execute promptly.

When Amazon CloudWatch metrics show increasing variance in task scheduling times, particularly during periods of high DAG activity, it signals the need for environment optimization. This scenario requires careful analysis of execution patterns and resource utilization to determine if:

While adding workers can help distribute the workload, this solution is most effective when the high utilization is primarily because of task execution load rather than DAG parsing or scheduling overhead. Adding more minimum workers will allow you to execute more tasks in parallel. For example, if you observe the value of AWS/MWAA/ApproximateAgeOfOldestTask to be steadily increasing, it means that the workers are not able to consume the messages from the queue fast enough. Additionally, you can also monitor the AWS/MWAA/QueuedTasks to identify similar patterns.
Upgrading the environment class would provide better scheduling capacity. If the Scheduler is showing signs of strain or if you’re seeing high resource utilization across all components, upgrading to a larger environment class might be the most appropriate solution. This provides more resources to both the Scheduler and Workers, allowing for better handling of increased DAG complexity and volume. To validate the same, use AWS/MWAA/CPUUtilization and AWS/MWAA/MemoryUtilization in the Cluster metrics and choose Scheduler, BaseWorker and AdditionalWorker metrics.
Restructuring DAG schedules would reduce resource contention.

The key is to understand your workflow patterns and identify whether the scheduling delays are because of insufficient worker capacity or other environmental constraints.

Anti patterns

This section showcases the most common anti patterns which make MWAA users think that adding more workers will improve performance.

Underutilized workers

When evaluating Amazon MWAA performance bottlenecks, it’s important to distinguish resource constraints and DAG design inefficiencies before scaling the environment.

Sometimes the Amazon MWAA environment has the capacity to run 100 tasks concurrently but your queue metrics (AWS/MWAA/RunningTasks) show only 20 tasks active most of the time with no tasks remaining in queued state. In such scenarios, you are advised to check Amazon CloudWatch for consistently low CPU and memory usage on existing workers during peak workload times. If this is confirmed, it is usually an indication of inefficiencies in DAG design, scheduling patterns, or Airflow configuration.

You have two primary options to address this:

1. Downsize: If you do not expect your workload to increase, it is safe to assume you have over-provisioned your cluster. Start by removing any extra workers first and finally resolve to downsizing your environment class.

2. Optimize: Fine tune your DAG scheduling and airflow configuration through Pools and Airflow configuration for concurrency to increase the throughput of your system.

Misconfigured Airflow configurations that create artificial bottlenecks

In Apache Airflow, performance bottlenecks often occur because of configuration settings, not actual resource constraints. At such times, DAG executions get delayed not because of insufficient compute, but because of incorrect concurrency configuration.

Efficient use of Amazon MWAA requires reviewing not only resource utilization for Workers and Schedulers but also concurrency configurations for artificially created bottlenecks. Sometimes one restrictive configuration prevents the scaling benefits of larger environment or additional workers. Always audit Airflow configurations if performance seems limited even when system metrics suggest spare capacity.

Important consideration: Amazon Managed Workflows for Apache Airflow (Amazon MWAA) does not automatically update the worker concurrency configuration when you change the environment class. This behavior is important to understand when scaling your environment. If you initially create an mw1.small environment, where each worker can handle up to 5 concurrent tasks by default. When you upgrade to a medium environment class (which supports 10 concurrent tasks per worker by default), the concurrency setting remains at 5 for in-place updated environments. You must manually update the concurrency configuration to take full advantage of the increased capacity available in the medium environment class.

Because of this you need to also update the Airflow configurations that control concurrency whenever you update the environment class. To update the concurrency setting after upgrading your environment class, modify the celery.worker_autoscale configuration in your Apache Airflow configuration options. This makes sure your workers can process the maximum number of concurrent tasks supported by your new environment class.

Other times, an Amazon MWAA environment can be constrained by max_active_runs or DAG concurrency controls instead of actual resource limits. These configuration-based throttles prevent tasks from running, even when the worker instances have available compute to handle the workload.

There is an important distinction between the two. Configuration limits act as artificial caps on parallelism, while true resource limits indicate that workers are fully utilizing their CPU or memory capacity. Understanding which type of constraint affects your environment helps you determine whether to adjust configuration settings or scale your infrastructure.

Adjusting Airflow configurations such as Pools, concurrency, max_active_runs solves performance problems without scaling workers. Some of the configurations you can use to control this behavior:

max_active_runs_per_dag (DAG level): Controls how many DAG runs for a given DAG are allowed at the same time. If set to 2, only 2 DAG runs can run concurrently, even if there is plenty of worker capacity left. Extra runs queue, making the DAG executions slow even though workers are idle.
max_active_tasks:Controls the concurrency field in a DAG definition (or setting at environment level) limits the number of tasks from the DAG running at any moment, regardless of overall system capacity or number of workers.
Pools:Pools restrict how many tasks of a certain type (often resource heavy) can run at once. A pool with only 3 slots will throttle any tasks above 3 assigned to that pool, leaving workers idle.
Execution timeouts and retries: If not tuned, failed tasks might fill up slots unnecessarily, stuck tasks can block worker slots and slow queue processing.
Scheduling intervals and dependencies: Overlapping or inefficient scheduling may cause idle periods or excess contention for resources, affecting real throughput.

How Airflow configurations can override each other

Airflow has multiple layers of concurrency and scheduling controls. Some at the environment level, some at the DAG/task level, and others for pools. Sometimes more restrictive settings override more permissive ones, resulting in unexpected queue buildup.

DAG level vs Environment level: If “max_active_runs_per_dag” (DAG level) is lower than the environment-level “max_active_runs_per_dag” or system wide concurrency, the DAG setting is used, throttling tasks even if the environment could do more.

Task level overrides: Individual task definitions can have their own parameters like “max_active_tis_per_dag” which can cap runs per task and create a bottleneck if set lower than global settings.

Order of precedence: The most restrictive relevant configuration at any level (Environment, DAG, Task) effectively sets the upper bound for parallel task execution.

Setting Location	Setting	Effect on task throughput
Environment Level	parallelism	Max total tasks running on Scheduler
DAG Level	max_active_runs	Max simultaneous DAG runs
Task Level	concurrency	Max concurrent task for that DAG

Performance issues often resemble resource exhaustion, but actually derive from overly restrictive configurations. Audit all the preceding parameters carefully. You can loosen restrictive values step by step and monitor their effect before deciding to scale your cluster further. This approach ensures optimal and cost-efficient usage of your cloud resources without paying for idle capacity.

Slow resource depletion from memory leaks

A common scenario for memory leak or slow resource depletion in Amazon MWAA is when DAGs and tasks begin to fail or slow down over time. Scaling workers or increasing environment size does not resolve the underlying issue. This happens because the root cause is not a lack of capacity but rather an application-level leak that causes persistent exhaustion.

For example, as Airflow continuously runs tasks and parses DAGs over time, memory consumption can steadily increase across the environment. This might manifest as an Amazon MWAA metadata database experiencing declining FreeableMemory metrics despite consistent or even reduced workloads. When this occurs, database query performance gradually declines as memory resources become constrained for scheduler/worker & metadata database, ultimately affecting overall environment responsiveness since Airflow depends heavily on its metadata database for critical operations. This scenario is similar to how an application might create database connections without properly closing them, leading to resource exhaustion over time.

Graph: Declining FreeableMemory and MemoryUtilization

Common causes:

Connection pool exhaustion: DAGs that fail to properly close database connections can lead to connection pool exhaustion and memory leaks in the database.
Resource-intensive operations: Complex, long-running queries or XCOM operations against the metadata database can consume excessive memory.
Inefficient DAG design: DAGs with numerous top-level Python calls can trigger database queries during DAG parsing. For instance, using variable.get() calls at the DAG level rather than at the task level creates unnecessary database load.

Recommended solutions:

Implement Amazon CloudWatch monitoring: Establish Amazon CloudWatch alarms for FreeableMemory with appropriate thresholds to detect issues early.
Regular database maintenance: Perform scheduled database clean-up operations to purge historical data that is no longer needed.
Optimize DAG code: Refactor DAGs to move database operations like variable.get() from the DAG level to the task level to reduce parsing overhead.
Connection management: Make sure all database connections are properly closed after use to prevent connection pool exhaustion.

By following the preceding recommendations you can maintain healthy memory utilization for the metadata database and maintain optimal performance of your Amazon MWAA environment without needing to scale workers.

Conclusion

The decision to add workers in Amazon MWAA environments requires careful consideration of multiple factors beyond simple task queue metrics. In this post, we showed that while adding workers can address certain performance challenges, it’s often not the optimal first response to system bottlenecks.

Key considerations before scaling workers include:

Root cause analysis
- Verify whether high CPU/memory usage stems from task optimization issues.
- Examine if queuing problems result from configuration constraints rather than resource limitations.
- Investigate potential memory leaks or resource depletion patterns.
Configuration optimization
- Review and adjust Airflow parameters (concurrency settings, pools, timeouts).
- Understand the interaction between different configuration layers.
- Optimize DAG design and scheduling patterns.

The most successful Amazon MWAA implementations follow a systematic approach: first optimizing existing resources and configurations, then scaling workers only when justified by data-driven capacity planning. This approach ensures cost-effective operations while maintaining reliable workflow performance.

Remember that worker scaling is only one tool in the Amazon MWAA optimization toolkit. Long-term success depends on building a comprehensive performance management strategy that combines proper monitoring, proactive capacity planning, and continuous optimization of your Airflow workflows.

In the next post, we discuss capacity planning and the steps you need to perform before adding additional DAGs in your environment so that you can plan for the additional load and make sure you have enough headroom.

To get started, visit the Amazon MWAA product page and the Performance tuning for Apache Airflow on Amazon MWAA page.

If you have questions or want to share your MWAA scaling experiences, leave a comment below.

About the authors

Unified observability in Amazon OpenSearch Service: metrics, traces, and AI agent debugging in a single interface

Muthu Pitchaimani — Tue, 28 Apr 2026 17:29:01 +0000

Amazon OpenSearch Service now brings application monitoring, native Amazon Managed Service for Prometheus integration, and AI agent tracing together in OpenSearch UI‘s observability workspace. You can query Prometheus metrics with PromQL alongside logs and traces stored in Amazon OpenSearch Service, trace an AI agent’s full reasoning chain down to the failing tool call, and drill from a service-level health view to the exact span that caused a checkout failure, all without leaving the interface.

In this post, we walk through two real-world scenarios using the OpenTelemetry sample app: a multi-agent travel planner facing slow processing, and a checkout flow quietly failing on one microservice. We chase each one to its root cause using these new capabilities.

Scenario 1: An underperforming AI agent

Your multi-agent travel planner is live and users start reporting slow responses. With the new AI agent tracing capability in Amazon OpenSearch Service, you can trace the agent’s full processing path to pinpoint exactly where things went wrong.

In any observability workspace in OpenSearch UI, navigate to Application Map in the left navigation pane.

You can see the full topology of your system including the travel agent and the sub-agents it calls. The travel agent node shows elevated latency and occasional errors. Select it, and the side panel confirms that latency is up but the latency chart shows intermittent spikes rather than consistent degradation.

The application map tells you something is wrong, but understanding why an AI agent is underperforming requires seeing its reasoning chain. Select Agent Traces in the left navigation pane, then filter by service name and time range.

Select one of the traces to see the trace tree. Unlike a traditional span waterfall, this view organizes around the agent’s reasoning chain: the root agent span, the LLM calls it made, the tools it invoked, and how they nested each step color-coded by type. The trace map provides a visual directed graph of the same execution. You can see which model was called, how many input and output tokens were consumed, and the actual messages sent to and received from the model.

A tool call inside the weather agent errored out. The agent then spent additional time reasoning about the failure before returning a partial response explaining the intermittent latency spikes and occasional faults.

Why this matters for AI agents

Agents make autonomous decisions based on LLM responses, tool results, and chained reasoning. Unlike traditional microservices with deterministic code paths, agent behavior varies across executions. Without semantic tracing that captures these AI-specific signals, root-cause analysis is guesswork. The trace tree surfaced the model name, token counts, and failing tool call because the travel planner was instrumented with OpenTelemetry’s generative AI semantic conventions. The next section describes how.

Instrumenting AI agents

OpenTelemetry auto-instrumentation enriches spans with well-known attributes for HTTP, database, and gRPC calls. AI agents need a different set of attributes such as which LLM was called, what tokens were consumed, which tools were invoked, that standard instrumentation doesn’t cover.

The OpenTelemetry gen_ai semantic conventions define standard attributes for these signals, including gen_ai.operation.name, gen_ai.usage.input_tokens, gen_ai.request.model, and gen_ai.tool.name. When Amazon OpenSearch Service receives spans with these attributes, it categorizes them by operation type (agent, LLM, tool, embeddings, retrieval) and renders the agent trace tree and trace map views.

The Python SDK provides one way to generate these spans. To send traces to Amazon OpenSearch Ingestion, configure the SDK with AWS Signature Version 4 (SigV4) authentication. The AWSSigV4OTLPExporter cryptographically signs each HTTP request to help prevent unauthorized data ingestion. The calling identity needs an IAM policy that grants osis:Ingest on your pipeline’s ARN. Credentials are resolved through the standard AWS credential provider chain.

from opensearch_genai_observability_sdk_py import register, AWSSigV4OTLPExporter

exporter = AWSSigV4OTLPExporter(
    endpoint="https://pipeline.us-east-1.osis.amazonaws.com/v1/traces",
    service="osis",
    region="us-east-1",
)

register(service_name="my-agent", exporter=exporter)

Use the @observe decorator to trace agent functions and enrich() to add model metadata:

@observe(op=Op.EXECUTE_TOOL)
def get_weather(city: str) -> dict:
    return {"city": city, "temp": 22, "condition": "sunny"}

@observe(op=Op.INVOKE_AGENT)
def assistant(query: str) -> str:
    enrich(model="gpt-4o", provider="openai")
    data = get_weather("Paris")
    return f"{data['condition']}, {data['temp']}C"

result = assistant("What's the weather?")

The SDK also supports auto-instrumentation for OpenAI, Anthropic, Amazon Bedrock, LangChain, LlamaIndex, and others. Because the instrumentation is built on OpenTelemetry standards, any agent framework that emits spans with gen_ai.* attributes is compatible with OpenSearch UI.

Scenario 2: Investigating a microservice issue

AI agents are only one part of most production environments. The same interface surfaces telemetry from conventional microservices, where the troubleshooting workflow follows a more familiar path.

Your ecommerce checkout begins paging during a busy traffic window. From OpenSearch UI, navigate to APM Services in the left navigation pane. Every instrumented service is listed alongside its health indicators. The checkout service shows an elevated error rate.

Select the affected service. The detail view shows Request, Error, and Duration (RED) metrics: request rate is climbing, fault rate has spiked in the last 15 minutes, and p99 duration has doubled. You can see exactly when the degradation started.

Drill into the correlated spans for the affected time window. The span list shows multiple failed requests, all hitting the same endpoint. Select one to see the full trace waterfall. The checkout service called prepareOrder, which failed trying to retrieve a product from the catalog. The error message in the span details tells you exactly what went wrong, that’s your root cause.

Checking the infrastructure with PromQL

In both scenarios, the natural next question is whether the problem originates in the application or in the infrastructure beneath it. With the new Amazon Managed Service for Prometheus integration, you can answer that question without leaving OpenSearch UI.

Prometheus metrics are now queryable directly from the same workspace using native PromQL syntax, alongside the logs and traces you’ve already been navigating.

For the database timeout in Scenario 2, run a PromQL query to check the database instance’s read/write throughput for the same time window. For the agent latency issue in Scenario 1, check the LLM endpoint’s response time metrics to see if the slowness originates from the model provider.

This is a key architectural decision: metrics continue to live in Amazon Managed Service for Prometheus, logs and traces continue to live in Amazon OpenSearch Service, and neither signal is copied or warehoused into a second store. Each backend remains the single store for the data type it’s purpose-built to handle, while OpenSearch UI federates queries across both at runtime. The cost, retention, and operational model of each store stay intact while the troubleshooting workflow collapses into a single interface.

To configure the OpenTelemetry Collector and OpenSearch Ingestion pipelines that route metrics into Amazon Managed Service for Prometheus, see Ingesting application telemetry.

How it’s wired together

The following diagram shows the end-to-end architecture. Applications instrumented with OpenTelemetry send traces, logs, and metrics over OTLP to Amazon OpenSearch Ingestion. OpenSearch Ingestion routes each signal to the appropriate store: traces and logs land in Amazon OpenSearch Service, while metrics flow into Amazon Managed Service for Prometheus. OpenSearch UI then queries both stores to render the Application Map, Services catalog, Agent Traces, and Metrics views.

The entire experience rests on open-source foundations, Prometheus for metrics, OpenSearch for logs and traces, and OpenTelemetry for instrumentation, so teams already running an OpenTelemetry collector can adopt it by updating the collector’s export configuration to point at Amazon OpenSearch Ingestion, with no proprietary agents or rewritten instrumentation required.

Getting started

To enable these capabilities, log in to OpenSearch UI’s observability workspace, select the Gear icon in the bottom left corner to open Settings and setup, and verify that the Observability:apmEnabled toggle is on under the Observability section. OpenSearch UI is available at no additional charge for Amazon OpenSearch Service customers.

Explore locally first. The OpenSearch Observability Stack gives you a fully configured environment including application monitoring, agent tracing, and Prometheus integration, running on your machine with a single install command. It ships with sample instrumented services, including a multi-agent travel planner, so you can explore the full workflow with real telemetry data out of the box.

For AI agent development. Agent Health is an open-source, evaluation-driven observability tool designed for local development. It gives you execution flow graphs, token tracking, and tool invocation visibility right in your development loop, before you push to production.

For production. The Python SDK provides one-line setup and decorator-based tracing with gen_ai semantic conventions, with auto-instrumentation support for OpenAI, Anthropic, Amazon Bedrock, LangChain, LlamaIndex, and others. See the Amazon OpenSearch Service documentation and the Amazon Managed Service for Prometheus integration guide for the full managed experience.