<?xml version="1.0" encoding="utf-8" ?>
<rss version="2.0"
     xmlns:dc="http://purl.org/dc/elements/1.1/"
     xmlns:content="http://purl.org/rss/1.0/modules/content/">

<channel>
  <title>Planet MySQL</title>
  <link>https://planet.mysql.com</link>
  <pubDate>Thu, 25 Jun 2026 02:49:13 +0000</pubDate>
  <language>en</language>
  <description>Planet MySQL - https://planet.mysql.com</description>

  <item>
    <title>Create Replica DB system Made Easy for MySQL HeatWave Service on OCI</title>
    <guid isPermaLink="false">eb060d1aca65e1d6fa513cc1773a86a0</guid>
    <link>https://blogs.oracle.com/mysql/create-replica-db-system-made-easy-for-mysql-heatwave-service-on-oci</link>
    <description>MySQL HeatWave Service (MHS) on Oracle Cloud Infrastructure (OCI) already provides multiple ways to create a new DB system, such as restoring from a backup, using Point-in-Time Recovery (PITR), or importing data from Object Storage. However, when creating a new DB system from an existing DB system, especially in another region, the process required several sequential […]</description>
    <pubDate>Tue, 23 Jun 2026 17:37:02 +0000</pubDate>
    <dc:creator>Oracle MySQL Group</dc:creator>
    <category>MySQL</category>
    <category>MySQL HeatWave</category>
    <category>News</category>
    <category>disaster recovery</category>
    <category>mysql</category>
    <category>OCI</category>
  </item>

  <item>
    <title>Extending MySQL Capabilities with UDFs, Plugins and Components</title>
    <guid isPermaLink="false">https://ronaldbradford.com/blog/2026-06-22-extending-mysql-capabilities/</guid>
    <link>https://ronaldbradford.com/blog/2026-06-22-extending-mysql-capabilities/</link>
    <description>MySQL offers three different approaches to extending the SQL capabilities with the default product you download and install. These are:
User Defined Function (UDF) MySQL Manual MySQL Plugin MySQL Manual MySQL Component MySQL Manual For the purposes of this post I will be using the current LTS version MySQL 8.</description>
    <pubDate>Mon, 22 Jun 2026 00:00:00 +0000</pubDate>
    <dc:creator>Ronald Bradford</dc:creator>
  </item>

  <item>
    <title>Do not uselessly grant CREATE and ALTER TABLE</title>
    <guid isPermaLink="false">tag:blogger.com,1999:blog-9188714267863327820.post-4014207229171056454</guid>
    <link>https://jfg-mysql.blogspot.com/2026/06/do-not-uselessly-grant-create-and-alter-table.html</link>
    <description>This lesson should have been learned with the CREATE TABLE of death, but it is worth a refresh.

Do not uselessly grant CREATE and ALTER TABLE

The reason I am posting this reminder is that another crashing bug related to DDL came to my attention.&amp;amp;nbsp; This bug is only fixed in a recent version of MySQL (probably not affecting 5.6 and 5.7), so if you are running the latest 8.0 or 8.4, you should</description>
    <content:encoded><![CDATA[This lesson should have been learned with the CREATE TABLE of death, but it is worth a refresh.

Do not uselessly grant CREATE and ALTER TABLE

The reason I am posting this reminder is that another crashing bug related to DDL came to my attention.&amp;nbsp; This bug is only fixed in a recent version of MySQL (probably not affecting 5.6 and 5.7), so if you are running the latest 8.0 or 8.4, you should]]></content:encoded>
    <pubDate>Sat, 20 Jun 2026 22:52:52 +0000</pubDate>
    <dc:creator>Jean-François Gagné</dc:creator>
    <category>Bug</category>
    <category>Crashing Bug</category>
    <category>Data Corruption Bug</category>
    <category>MySQL</category>
  </item>

  <item>
    <title>Switching to JSON Error Logging in MySQL</title>
    <guid isPermaLink="false">https://ronaldbradford.com/blog/2026-06-19-switching-to-json-error-logging-in-mysql/</guid>
    <link>https://ronaldbradford.com/blog/2026-06-19-switching-to-json-error-logging-in-mysql/</link>
    <description>You no longer need to manually parse the MySQL Error log via scripting and RegEx pattern matching. Using the component_log_sink_json component you can obtain JSON error logging for easier parsing.</description>
    <pubDate>Fri, 19 Jun 2026 00:00:00 +0000</pubDate>
    <dc:creator>Ronald Bradford</dc:creator>
  </item>

  <item>
    <title>Installing MySQL 9.7 LTS Community Edition on CentOS</title>
    <guid isPermaLink="false">https://ronaldbradford.com/blog/2026-06-18-installing-mysql-9-7-lts-centos/</guid>
    <link>https://ronaldbradford.com/blog/2026-06-18-installing-mysql-9-7-lts-centos/</link>
    <description>Historically installing MySQL on a RedHat Compatible Linux server was as simple as yum install mysql-server.
Today’s MySQL Oracle Linux, Red Hat Enterprise Linux, CentOS, and Fedora 9.7 instructions are not accurate mixing in 8.</description>
    <pubDate>Thu, 18 Jun 2026 00:00:00 +0000</pubDate>
    <dc:creator>Ronald Bradford</dc:creator>
  </item>

  <item>
    <title>MySQL 9.7 – Thank you for your contributions!</title>
    <guid isPermaLink="false">84df1080b51ce1f302620d0b2b1c5f33</guid>
    <link>https://blogs.oracle.com/mysql/mysql-9-7-thank-you-for-your-contributions</link>
    <description>On April 21st, 2026, we released MySQL 9.7.0, the latest Long-Term Support release. As always, we are grateful to the MySQL community for helping improve MySQL with bug reports, patches, pull requests, and continued feedback. Community contributions help make MySQL better for everyone, and we are happy to recognize the contributors whose work was included […]</description>
    <pubDate>Wed, 17 Jun 2026 08:00:00 +0000</pubDate>
    <dc:creator>Oracle MySQL Group</dc:creator>
    <category>MySQL</category>
    <category>MySQL Community</category>
    <category>mysql</category>
    <category>MySQL Contributions</category>
    <category>mysqlcommunity</category>
  </item>

  <item>
    <title>A More Predictable MySQL Release Model: Calendar Versions, LTS, and Innovation</title>
    <guid isPermaLink="false">4e51fcc673c5b40f6c76cb6886390974</guid>
    <link>https://blogs.oracle.com/mysql/a-more-predictable-mysql-release-model-calendar-versions-lts-and-innovation</link>
    <description>Understanding the New Cadence: Quarterly CPUs, Targeted CSPUs, and Transitioning to Calendar Versioning MySQL is updating its release model to make releases easier to understand, plan for, and follow: The goal is not simply to change the number on a release. The goal is to give users, DBAs, developers, Linux distributions, cloud platforms, and ecosystem […]</description>
    <pubDate>Tue, 16 Jun 2026 19:46:38 +0000</pubDate>
    <dc:creator>Oracle MySQL Group</dc:creator>
    <category>MySQL</category>
    <category>MySQL Community</category>
    <category>MySQL Enterprise</category>
    <category>MySQL HeatWave</category>
    <category>News</category>
    <category>Calendar Versioning</category>
    <category>CPU</category>
    <category>CSPU</category>
    <category>mysql</category>
    <category>MySQL Innovation</category>
    <category>MySQL LTS</category>
    <category>MySQL Releases</category>
    <category>mysqlcommunity</category>
    <category>Release Planning</category>
    <category>Security</category>
    <category>Security Updates</category>
  </item>

  <item>
    <title>Extending pt-archiver with a Partition-Aware Plug-in for Fast Retention Policy Enforcement</title>
    <guid isPermaLink="false">https://www.percona.com/?p=49241</guid>
    <link>https://www.percona.com/blog/extending-pt-archiver-with-a-partition-aware-plug-in-for-fast-retention-policy-enforcement/</link>
    <description>Managing data retention policies is one of the most common operational tasks in MySQL.
Applications continuously generate transactional, audit, logging, telemetry, and event data. Over time, these tables can grow to billions of rows, causing:

Larger backups
Longer recovery times
Reduced buffer pool efficiency
Slower index maintenance
Increased storage costs
Degraded query performance

To address these problems, organizations typically implement retention policies based on dates or timestamps. Examples include deleting events older than 90 days or purging session data older than 30 days and so forth. The deleted data can then eventually be archived somewhere else, like in another DBMS or on external files.
One of the most widely used tools for implementing these policies in MySQL ecosystems is pt-archiver, part of the Percona Toolkit.
This article provides a review of what pt-archiver is and how to use it, but in particular it focuses on the fact this tool is not partitioning aware, and this can make the deletion phase more costly. The article shows how to extend pt-archiver with a Perl plugin to make it aware of partitioning.
 
What is pt-archiver?
pt-archiver is a command-line utility from Percona Toolkit designed to:

Archive rows from MySQL tables
Purge rows from MySQL tables
Move data between tables into the local database or a remote one
Export rows into files

In a few words: implementing retention policies safely.
The tool processes rows incrementally in chunks, avoiding massive transactions and reducing impact on production systems.
Example:pt-archiver \
  --source h=localhost,D=mydb,t=events \
  --where &quot;created_at &amp;amp;lt; '2026-05-01'&quot; \
  --purge \
  --limit 1000 \
  --commit-eachThis command:

Scans rows matching the WHERE condition
Processes them in chunks of 1000 rows
Commits every chunk
Deletes matching rows from the source table

pt-archiver provides several advantages compared to ad-hoc DELETE statements.
Instead of running:DELETE FROM events
WHERE created_at &amp;amp;lt; '2026-05-01';which may:

Lock rows for a long time
Generate massive undo/redo logs
Create replication lag
Exhaust transaction logs

pt-archiver processes rows incrementally to make the process overhead less impactful for the database performance.
pt-archiver implementation permits flexible archival strategies
Rows can be copied to another table on a remote host, exported to files or removed completely
More details: ps://docs.percona.com/percona-toolkit/pt-archiver.html

Example: Copy rows to a remote archive table
The following example archives rows older than 90 days from a local table into an archive table hosted on a remote MySQL server:pt-archiver \
  --source h=localhost,D=sales,t=orders,u=archiver,p=secret \
  --dest h=archive-server,D=archive,t=orders_archive,u=archiver,p=secret \
  --where &quot;created_at &amp;amp;lt; '2026-05-01'&quot; \
  --limit 1000 \
  --commit-each \
  --progress 10000 \
  --statisticsIn this example:

–source defines the source table
–dest defines the remote archive destination
–where selects rows eligible for archival
–limit controls batch size
–commit-each commits every batch independently to reduce transaction overhead

–-progress reports progress every 10,000 rows
If rows should be removed from the source table after being copied, add –purge
Example: Export rows to a file
The following example exports rows older than one year into a text file:pt-archiver \
  --source h=localhost,D=sales,t=orders,u=archiver,p=secret \
  --where &quot;created_at &amp;amp;lt; NOW() - INTERVAL 1 YEAR&quot; \
  --file '/tmp/orders_archive_%Y-%m-%d.txt' \
  --output-format csv \
  --limit 1000 \
  --commit-each \
  --progress 10000 \
  --statisticsIn this example:

–file specifies the output file
–-output-format csv exports rows in CSV format
Date placeholders in the filename are expanded automatically

Rows can optionally be deleted from the source table by adding –purge
This allows pt-archiver to be used both for data retention and for offline archival workflows.

The Hidden Cost of DELETE Statements
Although pt-archiver is much safer than massive DELETE operations, it still fundamentally relies on DELETE statements.
This is a critical point.
Even when there are proper indexes, the rows are processed in chunks, and transactions are small; the large-scale DELETE operations remain expensive.
Deleting rows is expensive in InnoDB because it involves:

Locating rows via indexes
Modifying clustered indexes
Modifying secondary indexes
Generating undo logs
Generating redo logs
Purge thread processing
Replication event generation
Page fragmentation

When deleting billions of rows, the overhead becomes enormous.
Indexes help for sure, but only partially.
Consider:DELETE FROM events
WHERE created_at &amp;amp;lt; '2024-01-01';If created_at is indexed, MySQL can efficiently locate rows.
However, locating rows efficiently is only part of the cost. The actual delete operations still require all those things we mentioned above.
At considerable scale, this becomes expensive.

Why RANGE Partitioning is Superior for Retention Policies
For time-based retention policies, partitioning is often dramatically more efficient. In particular, RANGE partitioning is very useful for these cases.
Example:CREATE TABLE events (
    id BIGINT NOT NULL,
    created_at DATETIME NOT NULL,
    payload JSON,
    PRIMARY KEY(id, created_at)
)

PARTITION BY RANGE (TO_DAYS(created_at)) (
    PARTITION p202604 VALUES LESS THAN (TO_DAYS('2026-05-01')),
    PARTITION p202605 VALUES LESS THAN (TO_DAYS('2026-06-01')),
    PARTITION p202606 VALUES LESS THAN (TO_DAYS('2026-07-01'))
);With partitioning, dropping old data becomes:ALTER TABLE events DROP PARTITION p202604;This operation is dramatically faster than running a DELETE.
Dropping a partition:

Removes an entire physical partition
Avoids row-by-row DELETE
Avoids undo generation for each row
Avoids secondary index maintenance per row
Minimizes redo generation
Is nearly metadata-only

This can remove millions or billions of rows in a matter of seconds without the same large cost of DELETE.

The Problem: pt-archiver is Not Partition-Aware
Unfortunately, pt-archiver does not automatically understand partitioning strategies.
Even if the table is partitioned or the retention policy perfectly matches partition boundaries, pt-archiver still executes DELETE statements.
Example:pt-archiver \
  --where &quot;created_at &amp;amp;lt; NOW() - INTERVAL 90 DAY&quot; \
  --purgeInternally, this still produces DELETE … instead of ALTER TABLE … DROP PARTITION …
This means organizations may lose the major operational benefits of partitioning, or they need to implement custom scripts for managing the selection of rows to copy using pt-archiver and then use DROP PARTITION separately from the tool. That is doable, and to be honest, not too complicated, but why not make pt-archiver aware of partitioning for some specific use cases?

Extending pt-archiver with Pulg-ins
Fortunately, pt-archiver supports Perl plug-ins.
A plug-in can do plenty of things. Like: inspect runtime conditions, interact with MySQL, override behaviors, and execute custom logic
This gives us an opportunity to implement partition-aware retention handling.
The plug-in can:

Inspect partition definitions
Analyze the WHERE condition
Determine which partitions are fully expired
Execute ALTER TABLE DROP PARTITION
Prevent row-by-row DELETE processing

This approach combines the scheduling/orchestration power of pt-archiver with the efficiency of partition pruning.
Plug-in Design
Our plug-in will:

Connect using the pt-archiver DB handle
Inspect INFORMATION_SCHEMA.PARTITIONS
Identify partitions older than the retention cutoff
Issue DROP PARTITION statements
Log actions
Skip DELETE processing

Assumptions:

The table is RANGE partitioned
Partitions are DATETIME based using the TO_DAYS() function to define ranges
Partition naming convention contains dates
Retention policy aligns with partition boundaries; if the plugin cannot determine a specific boundary, pt-archiver does nothing


Full Perl Plug-in for pt-archiver
package pt_archiver_partition_drop;

use strict;
use warnings;

sub new {
    my ($class, %args) = @_;
    my $self = {
        dbh        =&amp;amp;gt; $args{dbh},
        db         =&amp;amp;gt; $args{db},
        tbl        =&amp;amp;gt; $args{tbl},
        statistics =&amp;amp;gt; {},
    };

    bless $self, $class;
    return $self;
}

sub statistics {
    my ($self) = @_;
    return $self-&amp;amp;gt;{statistics};
}


sub before_begin {
    my ($self) = @_;
    my $dbh = $self-&amp;amp;gt;{dbh} or die &quot;Missing dbh from pt-archiver\n&quot;;
    my $db  = $self-&amp;amp;gt;{db}  or die &quot;Missing db from pt-archiver plugin args\n&quot;;
    my $tbl = $self-&amp;amp;gt;{tbl} or die &quot;Missing tbl from pt-archiver plugin args\n&quot;;
    my $where  = _get_cmdline_option('where');
    my $dryrun = $ENV{PT_PARTITION_DROP_DRY_RUN} ? 1 : 0;

    die &quot;Missing --where from original command line\n&quot; unless $where;

    print &quot;PLUGIN before_begin called\n&quot;;
    print &quot;DB=$db TABLE=$tbl\n&quot;;
    print &quot;WHERE=$where\n&quot;;
    print &quot;PLUGIN_DRY_RUN=$dryrun\n&quot;;

    my ($column, $cutoff_date) = _parse_where($where);

    my $partitions = _get_partitions($dbh, $db, $tbl);

    if (!@$partitions) {
        print &quot;Table `$db`.`$tbl` is not partitioned. Refusing DELETE.\n&quot;;
        exit(0);
    }

    my $partition_expr = $partitions-&amp;amp;gt;[0]-&amp;amp;gt;{expression};
    die &quot;Missing PARTITION_EXPRESSION\n&quot;
        unless defined $partition_expr &amp;amp;amp;&amp;amp;amp; length $partition_expr;

    print &quot;Partition expression: $partition_expr\n&quot;;

    my $cutoff_value = _evaluate_cutoff(
        $dbh,
        $partition_expr,
        $column,
        $cutoff_date,
    );

    print &quot;Cutoff date: $cutoff_date\n&quot;;
    print &quot;Cutoff boundary value: $cutoff_value\n&quot;;

    my $matched;

    for my $p (@$partitions) {
        next if !defined $p-&amp;amp;gt;{description};
        next if uc($p-&amp;amp;gt;{description}) eq 'MAXVALUE';

        if ($p-&amp;amp;gt;{description} == $cutoff_value) {
            $matched = $p;
            last;
        }
    }


    if (!$matched) {
        print &quot;No exact partition boundary matches cutoff $cutoff_value. Refusing DELETE.\n&quot;;
        exit(0);
    }

    print &quot;Matched boundary partition: $matched-&amp;amp;gt;{name}, position $matched-&amp;amp;gt;{position}\n&quot;;

    my @drop;

    for my $p (@$partitions) {
        next if !defined $p-&amp;amp;gt;{description};
        next if uc($p-&amp;amp;gt;{description}) eq 'MAXVALUE';

        if ($p-&amp;amp;gt;{position} &amp;amp;lt;= $matched-&amp;amp;gt;{position}) {
            push @drop, $p-&amp;amp;gt;{name};
            print &quot;Eligible for DROP: $p-&amp;amp;gt;{name}, boundary $p-&amp;amp;gt;{description}\n&quot;;
        }
    }

    if (!@drop) {
        print &quot;No partitions eligible for DROP. Refusing DELETE.\n&quot;;
        exit(0);
    }

    my $sql = sprintf(
        &quot;ALTER TABLE %s.%s DROP PARTITION %s&quot;,
        _quote_ident($db),
        _quote_ident($tbl),
        join(&quot;, &quot;, map { _quote_ident($_) } @drop),
    );

    print &quot;SQL: $sql\n&quot;;

    if ($dryrun) {
        print &quot;PT_PARTITION_DROP_DRY_RUN enabled. Not executing DROP PARTITION.\n&quot;;
    }
    else {
        $dbh-&amp;amp;gt;do($sql);
        print &quot;Dropped partitions: &quot; . join(&quot;, &quot;, @drop) . &quot;\n&quot;;
    }

    $self-&amp;amp;gt;{statistics}-&amp;amp;gt;{partitions_dropped} = scalar @drop;

    exit(0);
}


sub _parse_where {
    my ($where) = @_;

    $where =~ s/^\s+|\s+$//g;

    die &quot;Only WHERE format supported: created_at &amp;amp;lt; 'YYYY-MM-DD'\n&quot;
        unless $where =~ /^`?([A-Za-z0-9_]+)`?\s*&amp;amp;lt;\s*'(\d{4}-\d{2}-\d{2})'\s*$/;

    return ($1, $2);
}

sub _evaluate_cutoff {
    my ($dbh, $partition_expr, $column, $cutoff_date) = @_;

    my $expr = $partition_expr;
    $expr =~ s/`//g;

    die &quot;Partition expression does not reference column `$column`: $partition_expr\n&quot;
        unless $expr =~ /\b\Q$column\E\b/i;

    $expr =~ s/\b\Q$column\E\b/'$cutoff_date'/ig;

    die &quot;Unsafe generated expression: $expr\n&quot;
        unless $expr =~ /^[A-Za-z0-9_\s\(\)\+\-\*\/,\.'&quot;:]+$/;

    my $sql = &quot;SELECT $expr&quot;;

    print &quot;Boundary evaluation SQL: $sql\n&quot;;

    my ($value) = $dbh-&amp;amp;gt;selectrow_array($sql);

    die &quot;Cannot evaluate cutoff expression: $sql\n&quot;
        unless defined $value;

    return $value;
}

sub _get_partitions {
    my ($dbh, $db, $tbl) = @_;

    my $sql = q{
        SELECT
            PARTITION_NAME,
            PARTITION_DESCRIPTION,
            PARTITION_EXPRESSION,
            PARTITION_ORDINAL_POSITION
        FROM INFORMATION_SCHEMA.PARTITIONS
        WHERE TABLE_SCHEMA = ?
          AND TABLE_NAME = ?
          AND PARTITION_NAME IS NOT NULL
        ORDER BY PARTITION_ORDINAL_POSITION
    };

    my $sth = $dbh-&amp;amp;gt;prepare($sql);
    $sth-&amp;amp;gt;execute($db, $tbl);
    my @partitions;

    while (my $row = $sth-&amp;amp;gt;fetchrow_hashref()) {
        push @partitions, {
            name        =&amp;amp;gt; $row-&amp;amp;gt;{PARTITION_NAME},
            description =&amp;amp;gt; $row-&amp;amp;gt;{PARTITION_DESCRIPTION},
            expression  =&amp;amp;gt; $row-&amp;amp;gt;{PARTITION_EXPRESSION},
            position    =&amp;amp;gt; $row-&amp;amp;gt;{PARTITION_ORDINAL_POSITION},
        };
    }

    return \@partitions;
}


sub _get_cmdline_option {

    my ($name) = @_;

    my $opt = &quot;--$name&quot;;

    for (my $i = 0; $i &amp;amp;lt; @ARGV; $i++) {
        if ($ARGV[$i] eq $opt &amp;amp;amp;&amp;amp;amp; defined $ARGV[$i + 1]) {
            return $ARGV[$i + 1];
        }

        if ($ARGV[$i] =~ /^\Q$opt\E=(.*)$/) {
            return $1;
        }
    }

    if (open my $fh, '&amp;amp;lt;', &quot;/proc/$$/cmdline&quot;) {
        local $/;
        my $raw = &amp;amp;lt;$fh&amp;amp;gt;;
        close $fh;

        my @cmd = split /\0/, $raw;

        for (my $i = 0; $i &amp;amp;lt; @cmd; $i++) {
            if ($cmd[$i] eq $opt &amp;amp;amp;&amp;amp;amp; defined $cmd[$i + 1]) {
                return $cmd[$i + 1];
            }

            if ($cmd[$i] =~ /^\Q$opt\E=(.*)$/) {
                return $1;
            }
        }
    }

    return undef;
}



sub _quote_ident {

    my ($ident) = @_;

    die &quot;Invalid identifier: $ident\n&quot;
        unless defined $ident &amp;amp;amp;&amp;amp;amp; $ident =~ /^[A-Za-z0-9_]+$/;

    return &quot;`$ident`&quot;;
}

1;Create the file named  pt_archiver_partition_drop.pm into the /usr/local/share/perl5 path.
Also set the environment variable PERL5LIB to let pt-archiver where to find the Perl packageexport PERL5LIB=/usr/local/share/perl5

Example Usage
First, create the partitioned table events and insert some fake data.DROP TABLE IF EXISTS events;


CREATE TABLE events (
  id BIGINT NOT NULL,
  created_at DATETIME NOT NULL,
  payload JSON DEFAULT NULL,
  PRIMARY KEY (id, created_at)
)
PARTITION BY RANGE (TO_DAYS(created_at)) (
  PARTITION p202604 VALUES LESS THAN (TO_DAYS('2026-05-01')),
  PARTITION p202605 VALUES LESS THAN (TO_DAYS('2026-06-01')),
  PARTITION p202606 VALUES LESS THAN (TO_DAYS('2026-07-01')),
  PARTITION pmax VALUES LESS THAN MAXVALUE
);

INSERT INTO events (id, created_at, payload) VALUES

-- p202604
(1,  '2026-04-01 08:00:00', JSON_OBJECT('event', 'login',    'user', 'alice')),
(2,  '2026-04-03 09:15:00', JSON_OBJECT('event', 'view',     'page', 'home')),
(3,  '2026-04-05 10:30:00', JSON_OBJECT('event', 'click',    'button', 'signup')),
(4,  '2026-04-08 11:45:00', JSON_OBJECT('event', 'search',   'term', 'mysql')),
(5,  '2026-04-10 12:00:00', JSON_OBJECT('event', 'purchase', 'amount', 100)),
(6,  '2026-04-14 13:20:00', JSON_OBJECT('event', 'logout',   'user', 'alice')),
(7,  '2026-04-18 14:35:00', JSON_OBJECT('event', 'download', 'file', 'report.pdf')),
(8,  '2026-04-22 15:50:00', JSON_OBJECT('event', 'upload',   'file', 'image.png')),
(9,  '2026-04-26 16:05:00', JSON_OBJECT('event', 'click',    'button', 'buy')),
(10, '2026-04-30 23:59:59', JSON_OBJECT('event', 'month_end')),

-- p202605

(11, '2026-05-01 00:00:00', JSON_OBJECT('event', 'login',    'user', 'bob')),
(12, '2026-05-03 08:10:00', JSON_OBJECT('event', 'view',     'page', 'pricing')),
(13, '2026-05-06 09:20:00', JSON_OBJECT('event', 'search',   'term', 'percona')),
(14, '2026-05-09 10:30:00', JSON_OBJECT('event', 'purchase', 'amount', 250)),
(15, '2026-05-12 11:40:00', JSON_OBJECT('event', 'logout',   'user', 'bob')),
(16, '2026-05-16 12:50:00', JSON_OBJECT('event', 'download', 'file', 'backup.sql')),
(17, '2026-05-20 13:00:00', JSON_OBJECT('event', 'upload',   'file', 'data.csv')),
(18, '2026-05-24 14:10:00', JSON_OBJECT('event', 'click',    'button', 'subscribe')),
(19, '2026-05-28 15:20:00', JSON_OBJECT('event', 'view',     'page', 'docs')),
(20, '2026-05-31 23:59:59', JSON_OBJECT('event', 'month_end')),

-- p202606

(21, '2026-06-01 00:00:00', JSON_OBJECT('event', 'login',    'user', 'carol')),
(22, '2026-06-03 08:05:00', JSON_OBJECT('event', 'search',   'term', 'partitioning')),
(23, '2026-06-06 09:15:00', JSON_OBJECT('event', 'view',     'page', 'dashboard')),
(24, '2026-06-09 10:25:00', JSON_OBJECT('event', 'purchase', 'amount', 500)),
(25, '2026-06-12 11:35:00', JSON_OBJECT('event', 'logout',   'user', 'carol')),
(26, '2026-06-16 12:45:00', JSON_OBJECT('event', 'login',    'user', 'dave')),
(27, '2026-06-20 13:55:00', JSON_OBJECT('event', 'download', 'file', 'archive.zip')),
(28, '2026-06-24 14:05:00', JSON_OBJECT('event', 'upload',   'file', 'video.mp4')),
(29, '2026-06-28 15:15:00', JSON_OBJECT('event', 'click',    'button', 'checkout')),
(30, '2026-06-30 23:59:59', JSON_OBJECT('event', 'month_end')),

-- pmax
(31, '2026-07-01 00:00:00', JSON_OBJECT('event', 'login',    'user', 'eve')),
(32, '2026-07-05 08:30:00', JSON_OBJECT('event', 'view',     'page', 'future')),
(33, '2026-07-10 09:45:00', JSON_OBJECT('event', 'search',   'term', 'maxvalue')),
(34, '2026-08-01 10:00:00', JSON_OBJECT('event', 'purchase', 'amount', 750)),
(35, '2026-09-01 11:15:00', JSON_OBJECT('event', 'retained_future')); 
Now you can run the following command to delete all rows before the 1st of May, which, by the way, matches the entire first partition in the table.pt-archiver \
  --source h=localhost,D=mydb,t=events,m=pt_archiver_partition_drop \
  --where &quot;created_at &amp;amp;lt; '2026-05-01'&quot; \
  --purge 
Notice the Perl plugin must be indicated with the m option in the DSN string.
In practice:

pt-archiver initializes
The plug-in runs
Partitions are dropped
No DELETE statements are executed

Here is what you get from the execution of the above command:PLUGIN before_begin called
DB=mydb TABLE=events
WHERE=created_at &amp;amp;lt; '2026-05-01'
PLUGIN_DRY_RUN=0
Partition expression: to_days(`created_at`)
Boundary evaluation SQL: SELECT to_days('2026-05-01')
Cutoff date: 2026-05-01
Cutoff boundary value: 740102
Matched boundary partition: p202604, position 1
Eligible for DROP: p202604, boundary 740102
SQL: ALTER TABLE `mydb`.`events` DROP PARTITION `p202604`
Dropped partitions: p202604You can simply verify the table has been managed correctly:
SELECT * FROM mydb.events;
SHOW CREATE TABLE mydb.events;
 
Now TRUNCATE the table and recreate the data and try now to specify the where conditions that match a RANGE that is not the first in the list of the boundaries.pt-archiver \
  --source h=localhost,D=mydb,t=events,m=pt_archiver_partition_drop \
  --where &quot;created_at &amp;amp;lt; '2026-06-01'&quot; \
  --purgeYou should get:PLUGIN before_begin called
DB=mydb TABLE=events
WHERE=created_at &amp;amp;lt; '2026-06-01'
PLUGIN_DRY_RUN=0
Partition expression: to_days(`created_at`)
Boundary evaluation SQL: SELECT to_days('2026-06-01')
Cutoff date: 2026-06-01
Cutoff boundary value: 740133
Matched boundary partition: p202605, position 2
Eligible for DROP: p202604, boundary 740102
Eligible for DROP: p202605, boundary 740133
SQL: ALTER TABLE `mydb`.`events` DROP PARTITION `p202604`, `p202605`
Dropped partitions: p202604, p202605In this case, two partitions have been identified and dropped.
 
Truncate the table and recreate the data again. Try now to provide a WHERE condition that does not match any of the boundaries in the RANGE.pt-archiver \
  --source h=localhost,D=mydb,t=events,m=pt_archiver_partition_drop \
  --where &quot;created_at &amp;amp;lt; '2026-04-25'&quot; \
  --purge 
You get the following:PLUGIN before_begin called
DB=mydb TABLE=events
WHERE=created_at &amp;amp;lt; '2026-04-25'
PLUGIN_DRY_RUN=0
Partition expression: to_days(`created_at`)
Boundary evaluation SQL: SELECT to_days('2026-04-25')
Cutoff date: 2026-04-25
Cutoff boundary value: 740096
No exact partition boundary matches cutoff 740096. Refusing DELETE.As expected, the tool now refuses to execute anything if it doesn’t find an exact match.
 
Operational Benefits
This approach provides major advantages.
Dropping partitions is vastly faster than deleting rows, and minimal binary logging is needed, compared to billions of row deletes. There is no massive transactional overhead for managing undo logs and purging. You get then a better InnoDB Buffer Pool stability because of less page churn.
In the end, retention jobs are completed quickly and consistently in a predictable way and at the minimal cost.
 
Important Caveats
Partition Boundaries Must Match Retention Policy
If partitions contain mixed retention windows, DROP PARTITION may remove too much data. For this reason, ensure correct partition design.
Recommended:

daily partitions
weekly partitions
monthly partitions

aligned with business retention requirements.
Metadata Locks
ALTER TABLE DROP PARTITION still acquires metadata locks.
Test carefully in production.
Backup Awareness
Ensure dropped partitions are no longer needed before removal or use pt-archiver to also copy the data into a remote server or dump the data into a CSV file before running the DROP PARTITION.
 
Possible Enhancements
The plug-in can be extended further.
Potential improvements:

Support for daily partitions
Support for UNIX timestamp partitions
Dry-run reporting
Automatic partition creation
Push Slack notifications
Export Prometheus metrics
Safety checks for replicas
GTID-aware orchestration
Integration with pt-online-schema-change workflows

These are just some ideas I had meanwhile doing my tests. What you can do by implementing a Perl plugin is only limited by your imagination and your real needs.

Conclusion
pt-archiver remains an excellent tool for implementing retention policies and archival workflows.
However, DELETE-based purging becomes increasingly expensive at scale, even with proper indexing and chunked processing.
For large time-series or historical datasets, RANGE partitioning is often a dramatically superior strategy.
The challenge is that pt-archiver does not natively leverage partition-level operations.
Fortunately, its Perl plug-in architecture allows advanced users to extend its behavior and implement partition-aware cleanup logic.
By combining:

pt-archiver orchestration
MySQL RANGE partitioning
Custom Perl plug-ins

Organizations can achieve:

Faster retention enforcement
Lower operational overhead
Smaller replication impact
Dramatically improved scalability

For large MySQL deployments, this hybrid approach can turn multi-hour purge operations into near-instant metadata operations.
The use case presented in this article is limited to a specific scenario, but you can reuse it or customize it if you have a different kind of RANGE partitioning, for example, not using TO_DAYS().
Take this as just an example of how you can extend pt-archiver. What you can do for real is driven by your needs and/or only limited by your imagination.
More info about extending pt-archiver:
https://docs.percona.com/percona-toolkit/pt-archiver.html#extending
 
The post Extending pt-archiver with a Partition-Aware Plug-in for Fast Retention Policy Enforcement appeared first on Percona.</description>
    <content:encoded><![CDATA[<p>Managing data retention policies is one of the most common operational tasks in MySQL.</p>
<p>Applications continuously generate transactional, audit, logging, telemetry, and event data. Over time, these tables can grow to billions of rows, causing:</p>
<ul>
<li aria-level="1">Larger backups</li>
<li aria-level="1">Longer recovery times</li>
<li aria-level="1">Reduced buffer pool efficiency</li>
<li aria-level="1">Slower index maintenance</li>
<li aria-level="1">Increased storage costs</li>
<li aria-level="1">Degraded query performance</li>
</ul>
<p>To address these problems, organizations typically implement retention policies based on dates or timestamps. Examples include deleting events older than 90 days or purging session data older than 30 days and so forth. The deleted data can then eventually be archived somewhere else, like in another DBMS or on external files.</p>
<p>One of the most widely used tools for implementing these policies in MySQL ecosystems is pt-archiver, part of the Percona Toolkit.</p>
<p>This article provides a review of what pt-archiver is and how to use it, but in particular it focuses on the fact this tool is not partitioning aware, and this can make the deletion phase more costly. The article shows how to extend pt-archiver with a Perl plugin to make it aware of partitioning.</p>
<p> </p>
<h2>What is pt-archiver?</h2>
<p>pt-archiver is a command-line utility from Percona Toolkit designed to:</p>
<ul>
<li aria-level="1">Archive rows from MySQL tables</li>
<li aria-level="1">Purge rows from MySQL tables</li>
<li aria-level="1">Move data between tables into the local database or a remote one</li>
<li aria-level="1">Export rows into files</li>
</ul>
<p>In a few words: implementing retention policies safely.</p>
<p>The tool processes rows incrementally in chunks, avoiding massive transactions and reducing impact on production systems.</p>
<p>Example:</p><pre class="urvanov-syntax-highlighter-plain-tag">pt-archiver \
  --source h=localhost,D=mydb,t=events \
  --where "created_at &amp;lt; '2026-05-01'" \
  --purge \
  --limit 1000 \
  --commit-each</pre><p>This command:</p>
<ul>
<li aria-level="1">Scans rows matching the WHERE condition</li>
<li aria-level="1">Processes them in chunks of 1000 rows</li>
<li aria-level="1">Commits every chunk</li>
<li aria-level="1">Deletes matching rows from the source table</li>
</ul>
<p>pt-archiver provides several advantages compared to ad-hoc DELETE statements.</p>
<p>Instead of running:</p><pre class="urvanov-syntax-highlighter-plain-tag">DELETE FROM events
WHERE created_at &amp;lt; '2026-05-01';</pre><p>which may:</p>
<ul>
<li aria-level="1">Lock rows for a long time</li>
<li aria-level="1">Generate massive undo/redo logs</li>
<li aria-level="1">Create replication lag</li>
<li aria-level="1">Exhaust transaction logs</li>
</ul>
<p>pt-archiver processes rows incrementally to make the process overhead less impactful for the database performance.</p>
<p>pt-archiver implementation permits flexible archival strategies</p>
<p>Rows can be copied to another table on a remote host, exported to files or removed completely</p>
<p>More details: <a href="https://docs.percona.com/percona-toolkit/pt-archiver.html#extending">ps://docs.percona.com/percona-toolkit/pt-archiver.html</a></p>
<h2></h2>
<h3>Example: Copy rows to a remote archive table</h3>
<p>The following example archives rows older than 90 days from a local table into an archive table hosted on a remote MySQL server:</p><pre class="urvanov-syntax-highlighter-plain-tag">pt-archiver \
  --source h=localhost,D=sales,t=orders,u=archiver,p=secret \
  --dest h=archive-server,D=archive,t=orders_archive,u=archiver,p=secret \
  --where "created_at &amp;lt; '2026-05-01'" \
  --limit 1000 \
  --commit-each \
  --progress 10000 \
  --statistics</pre><p>In this example:</p>
<ul>
<li aria-level="1"><span>–source</span> defines the source table</li>
<li aria-level="1"><span>–dest</span> defines the remote archive destination</li>
<li aria-level="1"><span>–where</span> selects rows eligible for archival</li>
<li aria-level="1"><span>–limit</span> controls batch size</li>
<li aria-level="1"><span>–commit-each</span> commits every batch independently to reduce transaction overhead</li>
</ul>
<p>–<span>-progress</span> reports progress every 10,000 rows</p>
<p>If rows should be removed from the source table after being copied, add <span>–purge</span></p>
<h3>Example: Export rows to a file</h3>
<p>The following example exports rows older than one year into a text file:</p><pre class="urvanov-syntax-highlighter-plain-tag">pt-archiver \
  --source h=localhost,D=sales,t=orders,u=archiver,p=secret \
  --where "created_at &amp;lt; NOW() - INTERVAL 1 YEAR" \
  --file '/tmp/orders_archive_%Y-%m-%d.txt' \
  --output-format csv \
  --limit 1000 \
  --commit-each \
  --progress 10000 \
  --statistics</pre><p>In this example:</p>
<ul>
<li aria-level="1"><span>–file</span> specifies the output file</li>
<li aria-level="1">–<span>-output-format csv</span> exports rows in CSV format</li>
<li aria-level="1">Date placeholders in the filename are expanded automatically</li>
</ul>
<p>Rows can optionally be deleted from the source table by adding <span>–purge</span></p>
<p>This allows pt-archiver to be used both for data retention and for offline archival workflows.</p>
<h1></h1>
<h2>The Hidden Cost of DELETE Statements</h2>
<p>Although pt-archiver is much safer than massive DELETE operations, it still fundamentally relies on DELETE statements.</p>
<p>This is a critical point.</p>
<p>Even when there are proper indexes, the rows are processed in chunks, and transactions are small; the large-scale DELETE operations remain expensive.</p>
<p>Deleting rows is expensive in InnoDB because it involves:</p>
<ul>
<li aria-level="1">Locating rows via indexes</li>
<li aria-level="1">Modifying clustered indexes</li>
<li aria-level="1">Modifying secondary indexes</li>
<li aria-level="1">Generating undo logs</li>
<li aria-level="1">Generating redo logs</li>
<li aria-level="1">Purge thread processing</li>
<li aria-level="1">Replication event generation</li>
<li aria-level="1">Page fragmentation</li>
</ul>
<p>When deleting billions of rows, the overhead becomes enormous.</p>
<p>Indexes help for sure, but only partially.</p>
<p>Consider:</p><pre class="urvanov-syntax-highlighter-plain-tag">DELETE FROM events
WHERE created_at &amp;lt; '2024-01-01';</pre><p>If <span>created_at</span> is indexed, MySQL can efficiently locate rows.</p>
<p>However, locating rows efficiently is only part of the cost. The actual delete operations still require all those things we mentioned above.</p>
<p>At considerable scale, this becomes expensive.</p>
<h1></h1>
<h2>Why RANGE Partitioning is Superior for Retention Policies</h2>
<p>For time-based retention policies, partitioning is often dramatically more efficient. In particular, RANGE partitioning is very useful for these cases.</p>
<p>Example:</p><pre class="urvanov-syntax-highlighter-plain-tag">CREATE TABLE events (
    id BIGINT NOT NULL,
    created_at DATETIME NOT NULL,
    payload JSON,
    PRIMARY KEY(id, created_at)
)

PARTITION BY RANGE (TO_DAYS(created_at)) (
    PARTITION p202604 VALUES LESS THAN (TO_DAYS('2026-05-01')),
    PARTITION p202605 VALUES LESS THAN (TO_DAYS('2026-06-01')),
    PARTITION p202606 VALUES LESS THAN (TO_DAYS('2026-07-01'))
);</pre><p>With partitioning, dropping old data becomes:</p><pre class="urvanov-syntax-highlighter-plain-tag">ALTER TABLE events DROP PARTITION p202604;</pre><p>This operation is dramatically faster than running a DELETE.</p>
<p>Dropping a partition:</p>
<ul>
<li aria-level="1">Removes an entire physical partition</li>
<li aria-level="1">Avoids row-by-row DELETE</li>
<li aria-level="1">Avoids undo generation for each row</li>
<li aria-level="1">Avoids secondary index maintenance per row</li>
<li aria-level="1">Minimizes redo generation</li>
<li aria-level="1">Is nearly metadata-only</li>
</ul>
<p>This can remove millions or billions of rows in a matter of seconds without the same large cost of DELETE.</p>
<h1></h1>
<h2>The Problem: pt-archiver is Not Partition-Aware</h2>
<p>Unfortunately, pt-archiver does not automatically understand partitioning strategies.</p>
<p>Even if the table is partitioned or the retention policy perfectly matches partition boundaries, pt-archiver still executes DELETE statements.</p>
<p>Example:</p><pre class="urvanov-syntax-highlighter-plain-tag">pt-archiver \
  --where "created_at &amp;lt; NOW() - INTERVAL 90 DAY" \
  --purge</pre><p>Internally, this still produces <strong><span>DELETE …</span></strong> instead of <strong><span>ALTER TABLE … DROP PARTITION …</span></strong></p>
<p>This means organizations may lose the major operational benefits of partitioning, or they need to implement custom scripts for managing the selection of rows to copy using pt-archiver and then use DROP PARTITION separately from the tool. That is doable, and to be honest, not too complicated, but why not make pt-archiver aware of partitioning for some specific use cases?</p>
<h1></h1>
<h2>Extending pt-archiver with Pulg-ins</h2>
<p>Fortunately, pt-archiver supports Perl plug-ins.</p>
<p>A plug-in can do plenty of things. Like: inspect runtime conditions, interact with MySQL, override behaviors, and execute custom logic</p>
<p>This gives us an opportunity to implement partition-aware retention handling.</p>
<p>The plug-in can:</p>
<ol>
<li aria-level="1">Inspect partition definitions</li>
<li aria-level="1">Analyze the WHERE condition</li>
<li aria-level="1">Determine which partitions are fully expired</li>
<li aria-level="1">Execute ALTER TABLE DROP PARTITION</li>
<li aria-level="1">Prevent row-by-row DELETE processing</li>
</ol>
<p>This approach combines the scheduling/orchestration power of pt-archiver with the efficiency of partition pruning.</p>
<h3>Plug-in Design</h3>
<p>Our plug-in will:</p>
<ul>
<li aria-level="1">Connect using the pt-archiver DB handle</li>
<li aria-level="1">Inspect INFORMATION_SCHEMA.PARTITIONS</li>
<li aria-level="1">Identify partitions older than the retention cutoff</li>
<li aria-level="1">Issue DROP PARTITION statements</li>
<li aria-level="1">Log actions</li>
<li aria-level="1">Skip DELETE processing</li>
</ul>
<p>Assumptions:</p>
<ul>
<li aria-level="1">The table is RANGE partitioned</li>
<li aria-level="1">Partitions are DATETIME based using the TO_DAYS() function to define ranges</li>
<li aria-level="1">Partition naming convention contains dates</li>
<li aria-level="1">Retention policy aligns with partition boundaries; if the plugin cannot determine a specific boundary, pt-archiver does nothing</li>
</ul>
<h1></h1>
<h2>Full Perl Plug-in for pt-archiver</h2>
<p></p><pre class="urvanov-syntax-highlighter-plain-tag">package pt_archiver_partition_drop;

use strict;
use warnings;

sub new {
    my ($class, %args) = @_;
    my $self = {
        dbh        =&amp;gt; $args{dbh},
        db         =&amp;gt; $args{db},
        tbl        =&amp;gt; $args{tbl},
        statistics =&amp;gt; {},
    };

    bless $self, $class;
    return $self;
}

sub statistics {
    my ($self) = @_;
    return $self-&amp;gt;{statistics};
}


sub before_begin {
    my ($self) = @_;
    my $dbh = $self-&amp;gt;{dbh} or die "Missing dbh from pt-archiver\n";
    my $db  = $self-&amp;gt;{db}  or die "Missing db from pt-archiver plugin args\n";
    my $tbl = $self-&amp;gt;{tbl} or die "Missing tbl from pt-archiver plugin args\n";
    my $where  = _get_cmdline_option('where');
    my $dryrun = $ENV{PT_PARTITION_DROP_DRY_RUN} ? 1 : 0;

    die "Missing --where from original command line\n" unless $where;

    print "PLUGIN before_begin called\n";
    print "DB=$db TABLE=$tbl\n";
    print "WHERE=$where\n";
    print "PLUGIN_DRY_RUN=$dryrun\n";

    my ($column, $cutoff_date) = _parse_where($where);

    my $partitions = _get_partitions($dbh, $db, $tbl);

    if (!@$partitions) {
        print "Table `$db`.`$tbl` is not partitioned. Refusing DELETE.\n";
        exit(0);
    }

    my $partition_expr = $partitions-&amp;gt;[0]-&amp;gt;{expression};
    die "Missing PARTITION_EXPRESSION\n"
        unless defined $partition_expr &amp;amp;&amp;amp; length $partition_expr;

    print "Partition expression: $partition_expr\n";

    my $cutoff_value = _evaluate_cutoff(
        $dbh,
        $partition_expr,
        $column,
        $cutoff_date,
    );

    print "Cutoff date: $cutoff_date\n";
    print "Cutoff boundary value: $cutoff_value\n";

    my $matched;

    for my $p (@$partitions) {
        next if !defined $p-&amp;gt;{description};
        next if uc($p-&amp;gt;{description}) eq 'MAXVALUE';

        if ($p-&amp;gt;{description} == $cutoff_value) {
            $matched = $p;
            last;
        }
    }


    if (!$matched) {
        print "No exact partition boundary matches cutoff $cutoff_value. Refusing DELETE.\n";
        exit(0);
    }

    print "Matched boundary partition: $matched-&amp;gt;{name}, position $matched-&amp;gt;{position}\n";

    my @drop;

    for my $p (@$partitions) {
        next if !defined $p-&amp;gt;{description};
        next if uc($p-&amp;gt;{description}) eq 'MAXVALUE';

        if ($p-&amp;gt;{position} &amp;lt;= $matched-&amp;gt;{position}) {
            push @drop, $p-&amp;gt;{name};
            print "Eligible for DROP: $p-&amp;gt;{name}, boundary $p-&amp;gt;{description}\n";
        }
    }

    if (!@drop) {
        print "No partitions eligible for DROP. Refusing DELETE.\n";
        exit(0);
    }

    my $sql = sprintf(
        "ALTER TABLE %s.%s DROP PARTITION %s",
        _quote_ident($db),
        _quote_ident($tbl),
        join(", ", map { _quote_ident($_) } @drop),
    );

    print "SQL: $sql\n";

    if ($dryrun) {
        print "PT_PARTITION_DROP_DRY_RUN enabled. Not executing DROP PARTITION.\n";
    }
    else {
        $dbh-&amp;gt;do($sql);
        print "Dropped partitions: " . join(", ", @drop) . "\n";
    }

    $self-&amp;gt;{statistics}-&amp;gt;{partitions_dropped} = scalar @drop;

    exit(0);
}


sub _parse_where {
    my ($where) = @_;

    $where =~ s/^\s+|\s+$//g;

    die "Only WHERE format supported: created_at &amp;lt; 'YYYY-MM-DD'\n"
        unless $where =~ /^`?([A-Za-z0-9_]+)`?\s*&amp;lt;\s*'(\d{4}-\d{2}-\d{2})'\s*$/;

    return ($1, $2);
}

sub _evaluate_cutoff {
    my ($dbh, $partition_expr, $column, $cutoff_date) = @_;

    my $expr = $partition_expr;
    $expr =~ s/`//g;

    die "Partition expression does not reference column `$column`: $partition_expr\n"
        unless $expr =~ /\b\Q$column\E\b/i;

    $expr =~ s/\b\Q$column\E\b/'$cutoff_date'/ig;

    die "Unsafe generated expression: $expr\n"
        unless $expr =~ /^[A-Za-z0-9_\s\(\)\+\-\*\/,\.'":]+$/;

    my $sql = "SELECT $expr";

    print "Boundary evaluation SQL: $sql\n";

    my ($value) = $dbh-&amp;gt;selectrow_array($sql);

    die "Cannot evaluate cutoff expression: $sql\n"
        unless defined $value;

    return $value;
}

sub _get_partitions {
    my ($dbh, $db, $tbl) = @_;

    my $sql = q{
        SELECT
            PARTITION_NAME,
            PARTITION_DESCRIPTION,
            PARTITION_EXPRESSION,
            PARTITION_ORDINAL_POSITION
        FROM INFORMATION_SCHEMA.PARTITIONS
        WHERE TABLE_SCHEMA = ?
          AND TABLE_NAME = ?
          AND PARTITION_NAME IS NOT NULL
        ORDER BY PARTITION_ORDINAL_POSITION
    };

    my $sth = $dbh-&amp;gt;prepare($sql);
    $sth-&amp;gt;execute($db, $tbl);
    my @partitions;

    while (my $row = $sth-&amp;gt;fetchrow_hashref()) {
        push @partitions, {
            name        =&amp;gt; $row-&amp;gt;{PARTITION_NAME},
            description =&amp;gt; $row-&amp;gt;{PARTITION_DESCRIPTION},
            expression  =&amp;gt; $row-&amp;gt;{PARTITION_EXPRESSION},
            position    =&amp;gt; $row-&amp;gt;{PARTITION_ORDINAL_POSITION},
        };
    }

    return \@partitions;
}


sub _get_cmdline_option {

    my ($name) = @_;

    my $opt = "--$name";

    for (my $i = 0; $i &amp;lt; @ARGV; $i++) {
        if ($ARGV[$i] eq $opt &amp;amp;&amp;amp; defined $ARGV[$i + 1]) {
            return $ARGV[$i + 1];
        }

        if ($ARGV[$i] =~ /^\Q$opt\E=(.*)$/) {
            return $1;
        }
    }

    if (open my $fh, '&amp;lt;', "/proc/$$/cmdline") {
        local $/;
        my $raw = &amp;lt;$fh&amp;gt;;
        close $fh;

        my @cmd = split /\0/, $raw;

        for (my $i = 0; $i &amp;lt; @cmd; $i++) {
            if ($cmd[$i] eq $opt &amp;amp;&amp;amp; defined $cmd[$i + 1]) {
                return $cmd[$i + 1];
            }

            if ($cmd[$i] =~ /^\Q$opt\E=(.*)$/) {
                return $1;
            }
        }
    }

    return undef;
}



sub _quote_ident {

    my ($ident) = @_;

    die "Invalid identifier: $ident\n"
        unless defined $ident &amp;amp;&amp;amp; $ident =~ /^[A-Za-z0-9_]+$/;

    return "`$ident`";
}

1;</pre><p>Create the file named  <b>pt_archiver_partition_drop.pm</b> into the <b>/usr/local/share/perl5</b> path.</p>
<p>Also set the environment variable <b>PERL5LIB</b> to let pt-archiver where to find the Perl package</p><pre class="urvanov-syntax-highlighter-plain-tag">export PERL5LIB=/usr/local/share/perl5</pre><p></p>
<h1></h1>
<h2>Example Usage</h2>
<p>First, create the partitioned table events and insert some fake data.</p><pre class="urvanov-syntax-highlighter-plain-tag">DROP TABLE IF EXISTS events;


CREATE TABLE events (
  id BIGINT NOT NULL,
  created_at DATETIME NOT NULL,
  payload JSON DEFAULT NULL,
  PRIMARY KEY (id, created_at)
)
PARTITION BY RANGE (TO_DAYS(created_at)) (
  PARTITION p202604 VALUES LESS THAN (TO_DAYS('2026-05-01')),
  PARTITION p202605 VALUES LESS THAN (TO_DAYS('2026-06-01')),
  PARTITION p202606 VALUES LESS THAN (TO_DAYS('2026-07-01')),
  PARTITION pmax VALUES LESS THAN MAXVALUE
);

INSERT INTO events (id, created_at, payload) VALUES

-- p202604
(1,  '2026-04-01 08:00:00', JSON_OBJECT('event', 'login',    'user', 'alice')),
(2,  '2026-04-03 09:15:00', JSON_OBJECT('event', 'view',     'page', 'home')),
(3,  '2026-04-05 10:30:00', JSON_OBJECT('event', 'click',    'button', 'signup')),
(4,  '2026-04-08 11:45:00', JSON_OBJECT('event', 'search',   'term', 'mysql')),
(5,  '2026-04-10 12:00:00', JSON_OBJECT('event', 'purchase', 'amount', 100)),
(6,  '2026-04-14 13:20:00', JSON_OBJECT('event', 'logout',   'user', 'alice')),
(7,  '2026-04-18 14:35:00', JSON_OBJECT('event', 'download', 'file', 'report.pdf')),
(8,  '2026-04-22 15:50:00', JSON_OBJECT('event', 'upload',   'file', 'image.png')),
(9,  '2026-04-26 16:05:00', JSON_OBJECT('event', 'click',    'button', 'buy')),
(10, '2026-04-30 23:59:59', JSON_OBJECT('event', 'month_end')),

-- p202605

(11, '2026-05-01 00:00:00', JSON_OBJECT('event', 'login',    'user', 'bob')),
(12, '2026-05-03 08:10:00', JSON_OBJECT('event', 'view',     'page', 'pricing')),
(13, '2026-05-06 09:20:00', JSON_OBJECT('event', 'search',   'term', 'percona')),
(14, '2026-05-09 10:30:00', JSON_OBJECT('event', 'purchase', 'amount', 250)),
(15, '2026-05-12 11:40:00', JSON_OBJECT('event', 'logout',   'user', 'bob')),
(16, '2026-05-16 12:50:00', JSON_OBJECT('event', 'download', 'file', 'backup.sql')),
(17, '2026-05-20 13:00:00', JSON_OBJECT('event', 'upload',   'file', 'data.csv')),
(18, '2026-05-24 14:10:00', JSON_OBJECT('event', 'click',    'button', 'subscribe')),
(19, '2026-05-28 15:20:00', JSON_OBJECT('event', 'view',     'page', 'docs')),
(20, '2026-05-31 23:59:59', JSON_OBJECT('event', 'month_end')),

-- p202606

(21, '2026-06-01 00:00:00', JSON_OBJECT('event', 'login',    'user', 'carol')),
(22, '2026-06-03 08:05:00', JSON_OBJECT('event', 'search',   'term', 'partitioning')),
(23, '2026-06-06 09:15:00', JSON_OBJECT('event', 'view',     'page', 'dashboard')),
(24, '2026-06-09 10:25:00', JSON_OBJECT('event', 'purchase', 'amount', 500)),
(25, '2026-06-12 11:35:00', JSON_OBJECT('event', 'logout',   'user', 'carol')),
(26, '2026-06-16 12:45:00', JSON_OBJECT('event', 'login',    'user', 'dave')),
(27, '2026-06-20 13:55:00', JSON_OBJECT('event', 'download', 'file', 'archive.zip')),
(28, '2026-06-24 14:05:00', JSON_OBJECT('event', 'upload',   'file', 'video.mp4')),
(29, '2026-06-28 15:15:00', JSON_OBJECT('event', 'click',    'button', 'checkout')),
(30, '2026-06-30 23:59:59', JSON_OBJECT('event', 'month_end')),

-- pmax
(31, '2026-07-01 00:00:00', JSON_OBJECT('event', 'login',    'user', 'eve')),
(32, '2026-07-05 08:30:00', JSON_OBJECT('event', 'view',     'page', 'future')),
(33, '2026-07-10 09:45:00', JSON_OBJECT('event', 'search',   'term', 'maxvalue')),
(34, '2026-08-01 10:00:00', JSON_OBJECT('event', 'purchase', 'amount', 750)),
(35, '2026-09-01 11:15:00', JSON_OBJECT('event', 'retained_future'));</pre><p> </p>
<p>Now you can run the following command to delete all rows before the 1st of May, which, by the way, matches the entire first partition in the table.</p><pre class="urvanov-syntax-highlighter-plain-tag">pt-archiver \
  --source h=localhost,D=mydb,t=events,m=pt_archiver_partition_drop \
  --where "created_at &amp;lt; '2026-05-01'" \
  --purge</pre><p> </p>
<p>Notice the Perl plugin must be indicated with the <b>m</b> option in the DSN string.</p>
<p>In practice:</p>
<ul>
<li aria-level="1">pt-archiver initializes</li>
<li aria-level="1">The plug-in runs</li>
<li aria-level="1">Partitions are dropped</li>
<li aria-level="1">No DELETE statements are executed</li>
</ul>
<p>Here is what you get from the execution of the above command:</p><pre class="urvanov-syntax-highlighter-plain-tag">PLUGIN before_begin called
DB=mydb TABLE=events
WHERE=created_at &amp;lt; '2026-05-01'
PLUGIN_DRY_RUN=0
Partition expression: to_days(`created_at`)
Boundary evaluation SQL: SELECT to_days('2026-05-01')
Cutoff date: 2026-05-01
Cutoff boundary value: 740102
Matched boundary partition: p202604, position 1
Eligible for DROP: p202604, boundary 740102
SQL: ALTER TABLE `mydb`.`events` DROP PARTITION `p202604`
Dropped partitions: p202604</pre><p>You can simply verify the table has been managed correctly:</p>
<p><span>SELECT * FROM mydb.events;</span></p>
<p><span>SHOW CREATE TABLE mydb.events;</span></p>
<p> </p>
<p>Now TRUNCATE the table and recreate the data and try now to specify the where conditions that match a RANGE that is not the first in the list of the boundaries.</p><pre class="urvanov-syntax-highlighter-plain-tag">pt-archiver \
  --source h=localhost,D=mydb,t=events,m=pt_archiver_partition_drop \
  --where "created_at &amp;lt; '2026-06-01'" \
  --purge</pre><p>You should get:</p><pre class="urvanov-syntax-highlighter-plain-tag">PLUGIN before_begin called
DB=mydb TABLE=events
WHERE=created_at &amp;lt; '2026-06-01'
PLUGIN_DRY_RUN=0
Partition expression: to_days(`created_at`)
Boundary evaluation SQL: SELECT to_days('2026-06-01')
Cutoff date: 2026-06-01
Cutoff boundary value: 740133
Matched boundary partition: p202605, position 2
Eligible for DROP: p202604, boundary 740102
Eligible for DROP: p202605, boundary 740133
SQL: ALTER TABLE `mydb`.`events` DROP PARTITION `p202604`, `p202605`
Dropped partitions: p202604, p202605</pre><p>In this case, two partitions have been identified and dropped.</p>
<p> </p>
<p>Truncate the table and recreate the data again. Try now to provide a WHERE condition that does not match any of the boundaries in the RANGE.</p><pre class="urvanov-syntax-highlighter-plain-tag">pt-archiver \
  --source h=localhost,D=mydb,t=events,m=pt_archiver_partition_drop \
  --where "created_at &amp;lt; '2026-04-25'" \
  --purge</pre><p> </p>
<p>You get the following:</p><pre class="urvanov-syntax-highlighter-plain-tag">PLUGIN before_begin called
DB=mydb TABLE=events
WHERE=created_at &amp;lt; '2026-04-25'
PLUGIN_DRY_RUN=0
Partition expression: to_days(`created_at`)
Boundary evaluation SQL: SELECT to_days('2026-04-25')
Cutoff date: 2026-04-25
Cutoff boundary value: 740096
No exact partition boundary matches cutoff 740096. Refusing DELETE.</pre><p>As expected, the tool now refuses to execute anything if it doesn’t find an exact match.</p>
<p> </p>
<h2>Operational Benefits</h2>
<p>This approach provides major advantages.</p>
<p>Dropping partitions is vastly faster than deleting rows, and minimal binary logging is needed, compared to billions of row deletes. There is no massive transactional overhead for managing undo logs and purging. You get then a better InnoDB Buffer Pool stability because of less page churn.</p>
<p>In the end, retention jobs are completed quickly and consistently in a predictable way and at the minimal cost.</p>
<p> </p>
<h2>Important Caveats</h2>
<h3>Partition Boundaries Must Match Retention Policy</h3>
<p>If partitions contain mixed retention windows, DROP PARTITION may remove too much data. For this reason, ensure correct partition design.</p>
<p>Recommended:</p>
<ul>
<li aria-level="1">daily partitions</li>
<li aria-level="1">weekly partitions</li>
<li aria-level="1">monthly partitions</li>
</ul>
<p>aligned with business retention requirements.</p>
<h3>Metadata Locks</h3>
<p><span>ALTER TABLE DROP PARTITION</span> still acquires metadata locks.</p>
<p>Test carefully in production.</p>
<h3>Backup Awareness</h3>
<p>Ensure dropped partitions are no longer needed before removal or use pt-archiver to also copy the data into a remote server or dump the data into a CSV file before running the DROP PARTITION.</p>
<p> </p>
<h2>Possible Enhancements</h2>
<p>The plug-in can be extended further.</p>
<p>Potential improvements:</p>
<ul>
<li aria-level="1">Support for daily partitions</li>
<li aria-level="1">Support for UNIX timestamp partitions</li>
<li aria-level="1">Dry-run reporting</li>
<li aria-level="1">Automatic partition creation</li>
<li aria-level="1">Push Slack notifications</li>
<li aria-level="1">Export Prometheus metrics</li>
<li aria-level="1">Safety checks for replicas</li>
<li aria-level="1">GTID-aware orchestration</li>
<li aria-level="1">Integration with pt-online-schema-change workflows</li>
</ul>
<p>These are just some ideas I had meanwhile doing my tests. What you can do by implementing a Perl plugin is only limited by your imagination and your real needs.</p>
<h1></h1>
<h2>Conclusion</h2>
<p>pt-archiver remains an excellent tool for implementing retention policies and archival workflows.</p>
<p>However, DELETE-based purging becomes increasingly expensive at scale, even with proper indexing and chunked processing.</p>
<p>For large time-series or historical datasets, RANGE partitioning is often a dramatically superior strategy.</p>
<p>The challenge is that pt-archiver does not natively leverage partition-level operations.</p>
<p>Fortunately, its Perl plug-in architecture allows advanced users to extend its behavior and implement partition-aware cleanup logic.</p>
<p>By combining:</p>
<ul>
<li aria-level="1">pt-archiver orchestration</li>
<li aria-level="1">MySQL RANGE partitioning</li>
<li aria-level="1">Custom Perl plug-ins</li>
</ul>
<p>Organizations can achieve:</p>
<ul>
<li aria-level="1">Faster retention enforcement</li>
<li aria-level="1">Lower operational overhead</li>
<li aria-level="1">Smaller replication impact</li>
<li aria-level="1">Dramatically improved scalability</li>
</ul>
<p>For large MySQL deployments, this hybrid approach can turn multi-hour purge operations into near-instant metadata operations.</p>
<p>The use case presented in this article is limited to a specific scenario, but you can reuse it or customize it if you have a different kind of RANGE partitioning, for example, not using TO_DAYS().</p>
<p>Take this as just an example of how you can extend pt-archiver. What you can do for real is driven by your needs and/or only limited by your imagination.</p>
<p>More info about extending pt-archiver:<br>
<a href="https://docs.percona.com/percona-toolkit/pt-archiver.html#extending">https://docs.percona.com/percona-toolkit/pt-archiver.html#extending</a></p>
<p> </p>
<p>The post <a href="https://www.percona.com/blog/extending-pt-archiver-with-a-partition-aware-plug-in-for-fast-retention-policy-enforcement/">Extending pt-archiver with a Partition-Aware Plug-in for Fast Retention Policy Enforcement</a> appeared first on <a href="https://www.percona.com/">Percona</a>.</p>]]></content:encoded>
    <pubDate>Tue, 16 Jun 2026 11:31:51 +0000</pubDate>
    <dc:creator>Corrado Pandiani</dc:creator>
    <category>Insight for DBAs</category>
    <category>Insight for Developers</category>
    <category>MySQL</category>
    <category>Open Source</category>
    <category>Percona Software</category>
  </item>

  <item>
    <title>Using GenAI directly in the database. A practical example using MySQL 8.0</title>
    <guid isPermaLink="false">https://ronaldbradford.com/blog/2026-06-16-using-genai-directly-with-mysql-8/</guid>
    <link>https://ronaldbradford.com/blog/2026-06-16-using-genai-directly-with-mysql-8/</link>
    <description>If you have a typical MySQL production setup using MySQL 8.0 (EOL) with replication, you can take advantage of VillageSQL extensions to generate AI responses directly with your source data with no impact on your production setup or existing application software.</description>
    <pubDate>Tue, 16 Jun 2026 00:00:00 +0000</pubDate>
    <dc:creator>Ronald Bradford</dc:creator>
  </item>

  <item>
    <title>The Failover Brownout: Rethinking High Availability in MySQL Group Replication</title>
    <guid isPermaLink="false">https://www.percona.com/?p=49852</guid>
    <link>https://www.percona.com/blog/the-failover-brownout-rethinking-high-availability-in-mysql-group-replication/</link>
    <description>It is time to talk again about Flow control and group replication. This time with a special eye on the use of Group Replication in the Kubernetes context. In this article we will dig a bit on how it works and what are the various side effects. 
 
The problem
Recently I was refining the calculation I use in the MySQL calculator for Operator given I was constantly encountering a very serious problem with the Percona Server Operator.
The problem is that when the deployment was/is serving a high level of traffic, it will, no matter what, end up in getting OMMKill by the K8 system. 
This because the pod was gradually consuming more and more memory, reaching the memory limit set in the CR specification. 
 
Now let me clarify a few things, to get straight to the facts.
Kubernetes itself does not OOMKill a pod for hitting its memory limit, the mechanism works as described below with mention on how Working Set Size (WSS) is calculated, and how OOMKills are triggered, and in the resource sections, the links to the official documentation and source code.
 
1. The Reality of OOMKills vs. Kubelet Evictions
It is crucial to distinguish between what the Linux kernel does and what Kubernetes does:

OOMKilled (Exit Code 137): This is executed entirely by the Linux kernel’s OOM Killer, not Kubernetes. When we set a memory limit in our Pod spec, Kubernetes translates that into a Linux cgroup constraint (memory.limit_in_bytes for cgroups v1, or memory.max for cgroups v2). If our container attempts to allocate more memory than this hard limit, and the kernel cannot reclaim any page cache (like inactive files), the kernel directly intervenes and terminates the process.
Node-Pressure Evictions: This is where Kubernetes actively observes memory. The kubelet monitors the working_set_bytes metric to protect the node from running out of memory. If the node’s memory drops below an eviction threshold, Kubernetes will actively evict pods to prevent the kernel from initiating a system-wide OOM kill.

2. How Working Set Size (WSS) is Calculated for the container
Kubernetes monitors container memory via cAdvisor, which is integrated directly into the kubelet. cAdvisor calculates the Working Set Size by taking the total memory usage and subtracting the inactive file cache (memory that the kernel can easily reclaim if it faces memory pressure).
Because active file caches and anonymous memory (like our application’s heap) cannot be easily evicted, this working set metric is the most accurate representation of the memory your container is forcing the system to hold.
 
The Calculation &amp;amp; cgroups Evolution The core mathematical calculation is Memory Usage – Inactive File Cache, but how cAdvisor fetches this data from the Linux kernel depends entirely on your node’s cgroup version. Modern cAdvisor relies heavily on the opencontainers/runc/libcontainer library to read these raw cgroup files:

cgroups v1: cAdvisor starts with the raw usage from memory.usage_in_bytes and subtracts the reclaimable cache found under the total_inactive_file key.
cgroups v2 (Unified): cAdvisor starts with the raw usage from memory.current and subtracts the reclaimable cache found under the inactive_file key.

 
The Underlying Code Logic While older versions used a static setMemoryStats function, modern Kubernetes branches handle this dynamically. The logic executes the following flow before reporting back to the kubelet:

Detects Version: It identifies whether the node runs cgroups v1 or v2 to determine the correct inactive file key name.
Fetch Usage: It pulls the raw memory usage from the container.
Subtract Cache: It looks up the inactive file value and safely subtracts it from the usage (including a safeguard to ensure the working set never drops below zero).
Report Metric: It sets this final calculated value as container_memory_working_set_bytes, which the kubelet then uses to decide if the node is under memory pressure.

Back to us 
At the end the point is that if our pod reaches the limit and we ARE NOT using the new swap feature existing in Kubernetes, our pod will be brutally killed, and in 99% of the cases our production will suffer a lot. !Ops spoiler!
 
To clearly understand what was causing the issue about this memory consumption and having my calculator fail, I started to collect the information about the memory usage in MySQL itself.
 
SELECT EVENT_NAME,CURRENT_NUMBER_OF_BYTES_USED / 1024 / 1024 AS current_usage_mb FROM performance_schema.memory_summary_global_by_event_name WHERE EVENT_NAME like ‘memory/%’ and EVENT_NAME not like ‘memory/performance%’  order by current_usage_mb desc limit 25;
Which will give you and output like this:+---------------------------------------+------------------+
| EVENT_NAME                            | current_usage_mb |
+---------------------------------------+------------------+
| memory/innodb/buf_buf_pool            |   46398.92578125 |
| memory/group_rpl/GCS_XCom::xcom_cache |    1066.66179943 |
| memory/group_rpl/certification_info   |      92.45250702 |
| memory/innodb/log_buffer_memory       |      64.00096130 |
| memory/sql/TABLE                      |      49.90627003 |
| memory/innodb/memory                  |      34.68734741 |
| memory/innodb/ut0link_buf             |      24.00006104 |
| memory/innodb/lock0lock               |      21.40064240 |
| memory/mysqld_openssl/openssl_malloc  |       9.51009655 |
| memory/innodb/read0read               |       8.19496155 |
| memory/mysys/KEY_CACHE                |       8.00215149 |
| memory/innodb/sync0arr                |       7.03147125 |
| memory/innodb/ha_innodb               |       6.87006950 |
| memory/innodb/lock_sys                |       5.25009155 |
| memory/sql/log_sink_pfs               |       5.00003052 |
| memory/innodb/ut0pool                 |       4.00017548 |
| memory/sql/dd::objects                |       2.83031464 |
| memory/innodb/std                     |       2.72618866 |
| memory/innodb/os0file                 |       2.63054657 |
| memory/innodb/os0event                |       2.34302521 |
| memory/sql/TABLE_SHARE::mem_root      |       2.31734467 |
| memory/innodb/trx0trx                 |       2.22647858 |
| memory/temptable/physical_ram         |       1.00003052 |
| memory/sql/dd::String_type            |       0.94942093 |
| memory/innodb/btr0pcur                |       0.89743423 |
+---------------------------------------+------------------+ 
Plus I used PMM to collect memory information 

To simulate the load I used the sysbench-tpcc (tpc-c derivate test) variant and run the tests simulating a load of 1024 threads against a cluster based on machine with 16 Core and 64Gb volumes ~3k IOPS, so not gigantic but not small. 
 
The finding was almost immediate:+---------------------------------------+------------------+
| EVENT_NAME                            | current_usage_mb |
+---------------------------------------+------------------+
| memory/innodb/buf_buf_pool            |   46398.92578125 |
| memory/group_rpl/certification_info   |    1431.67934418 | &amp;lt;constantly increasing
| memory/group_rpl/GCS_XCom::xcom_cache |    1066.63542366 |
| memory/sql/Gtid_set::Interval_chunk   |      95.52413940 |
| memory/innodb/log_buffer_memory       |      64.00096130 |
| memory/sql/TABLE                      |      48.17613125 |
| memory/innodb/memory                  |      35.08897400 |
| memory/innodb/ut0link_buf             |      24.00006104 |
| memory/innodb/lock0lock               |      21.40064240 |
| memory/innodb/read0read               |      14.86782837 |
| memory/mysqld_openssl/openssl_malloc  |      12.05916119 |
| memory/mysys/KEY_CACHE                |       8.00215149 |
| memory/innodb/sync0arr                |       7.03147125 |
| memory/innodb/ha_innodb               |       6.84074974 |
| memory/innodb/lock_sys                |       5.25009155 |
| memory/sql/log_sink_pfs               |       5.00003052 |
| memory/innodb/ut0pool                 |       4.00017548 |
| memory/sql/dd::objects                |       2.82012177 |
| memory/innodb/std                     |       2.72515869 |
| memory/innodb/os0file                 |       2.63054657 |
| memory/innodb/os0event                |       2.35884857 |
| memory/innodb/trx0trx                 |       2.22647858 |
| memory/sql/TABLE_SHARE::mem_root      |       1.83777618 |
| memory/innodb/trx0undo                |       1.26304626 |
| memory/mysys/lf_node                  |       1.08828735 |
+---------------------------------------+------------------+ 

Ok then … What is the certification info???
What is group_rpl/certification_info?
In MySQL, memory/group_rpl/certification_info is a Performance Schema memory instrument. It tracks the exact amount of RAM allocated to store the Certification Database (or Certification Info).
In Group Replication, nodes do not lock rows across the network while a transaction is executing. Instead, transactions execute locally and optimistically. When it is time to commit, the transaction undergoes a Certification Process to ensure no other concurrent transaction in the cluster has modified the exact same rows. The certification_info buffer is the in-memory hash map that makes this conflict detection possible.
1. What is it used for?
The certification_info structure acts as a tracking ledger for recently modified rows.
Here is how it works under the hood:

The Key-Value Pair: It is fundamentally an in-memory dictionary. The key is the hash of a modified row (extracted from the transaction’s “write set”), and the value is the Global Transaction Identifier (GTID) of the transaction that successfully modified it.
Conflict Detection: When a new transaction attempts to commit, it broadcasts its write set and the “snapshot version” of the database it saw when it started. The certifier cross-references the incoming transaction’s write set against the certification_info map.
The Decision: If the certification_info shows that a row was modified by a newer GTID that the incoming transaction did not “see” when it started, a conflict is flagged, and the transaction is aborted. If no conflict exists, the transaction is certified, and the certification_info map is updated with the new write set and GTID.

The primary does not hold onto this memory out of stubbornness; it does so because purging that data too early would destroy the cluster’s consistency in the event of a failover.
 
In Group Replication, garbage collection for the certification_info buffer is not triggered just because a transaction commits on the primary. It is triggered by a concept called the Stable Set. 
Every node in the cluster periodically broadcasts a message to the rest of the group saying, “Here are the GTIDs I have successfully applied to my disk.” The cluster then calculates a global low watermark. This watermark is the highest transaction GTID that every single member of the group has successfully applied. Garbage collection is only allowed to purge write-sets from the certification database that fall below this global watermark. 
To note that this purge is a synchronous operation during which writes are forbidden.
2. How the Apply Queue Stalls the Watermark
When a secondary node starts lagging, its applier queue grows. This means the secondary is receiving transactions from the network quickly, but its SQL thread is too slow to actually execute them and commit them to disk.
Because the secondary hasn’t applied these transactions, it cannot report those GTIDs back to the group as “finished.”

The lagging secondary’s local watermark stalls.
Therefore, the global low watermark for the entire cluster stalls.
Because the global watermark hasn’t moved forward, the garbage_collect function on the primary (and all other nodes) says, “I am not allowed to delete any write-sets yet.”
As the primary continues to process new writes, the certification_info memory buffer grows continuously.

3. Why the Primary Cannot Purge Early
we might wonder: If the transaction is already committed on the primary, why does the primary care if the secondary has applied it? Why not just drop the write-set from its own memory?
The answer comes down to Failover Safety and Distributed Conflict Detection. GR is a shared-nothing, decentralized architecture. Even if you are running in Single-Primary  mode (keep this in mind will be important later), the underlying engine uses the exact same logic as Multi-Primary mode. 
Here is why the primary is forbidden from purging that data:

The Failover Scenario: Imagine our primary node crashes right now. The lagging secondary (which still has a massive apply queue) is immediately elected as the new primary.
The Conflict Risk: As the new primary, it starts accepting new writes from your application. However, it still has thousands of old transactions in its applier queue that it hasn’t written to disk yet!
The Necessity of the Buffer: When a new write comes in, the new primary must check if that write conflicts with any of the pending transactions in its apply queue. It does this by checking the certification_info map. If the old primary had purged the global certification data early, the new primary wouldn’t have the write-sets for those pending transactions. It would blindly accept the new write, causing a massive data conflict and breaking the replication group entirely.

Fine Marco, then what is the effect of this?
 
Well, drums roll …
… When a secondary node is elected as the new primary during a failover, it does not immediately open the floodgates to new writes. It keeps its super_read_only variable set to ON until it has completely drained its local apply queue of all transactions that were certified prior to the election.
This is an intentional design choice to guarantee that the new primary’s state is completely consistent with the old primary before it starts accepting new data.
 
4. Immediate Write Rejections (No Built-in Queuing)
The most critical impact to understand is that the new primary does not queue or pause new incoming writes while it catches up. It outright rejects them.
If our application or proxy routes a COMMIT, INSERT, UPDATE, or DELETE to the new primary while it is still processing the old queue, MySQL will immediately throw an error back to the client:
ERROR 1290 (HY000): The MySQL server is running with the –super-read-only option so it cannot execute this statement
5. The “Brownout” Window (Write Outage)
Because of this behavior, a failover in MySQL Group Replication does not instantly restore write availability. Our cluster experiences a “brownout”, a period where reads might succeed, but writes are entirely blocked.
The duration of this write outage is directly proportional to the size of the apply queue.

If the secondary was fully caught up, write availability is restored in milliseconds.
If the secondary was lagging by 50 minutes, your application will suffer a 50 minute write outage while the node applies the backlog.

6. Impact on Proxies (e.g., MySQL Router or ProxySQL)
If we are using a proxy layer to route your database traffic, the apply queue dictates how the proxy behaves during the transition:

MySQL Router: It continuously monitors the cluster topology and the super_read_only flag. Even though the node has technically been elected primary, Router will not open the read-write port to it until the apply queue drains and super_read_only flips to OFF. Depending on your application timeouts, client connections will either hang waiting for a writable connection or fail completely.
ProxySQL: Similar to Router, if it is configured to check for the read_only state, it will temporarily quarantine the new primary from the write hostgroup.
HAProxy (in Operator): Monitor both Primary state and read_only state, but it expose the Primary to writes causing the application to fail (bug we need to fix)  

7. Read Traffic and Stale Data
During this catch-up phase, the node will accept incoming SELECT queries (since it is still a valid database). However, because it is actively churning through the old primary’s backlog, the data being read is temporarily stale.
If your application reads a row that is sitting in the apply queue but hasn’t been committed to disk yet, it will get the old version of that row.
Why Flow Control is Critical
Because a large apply queue turns a seamless failover into a severe, application-breaking write outage, Group Replication includes the Flow Control feature.
Flow Control monitors the size of the apply queues across all secondaries. If a secondary starts lagging too far behind, Flow Control should actively throttle the write throughput on the current primary to allow the lagging node to catch up. It is essentially a trade-off: we accept a slight performance hit during normal operations to guarantee that your database recovers almost instantly during a failover.
However, this is not what really happens.
1. It is Reactive, Not Proactive (The Polling Blind Spot)
Flow control does not intercept and evaluate every single transaction in real-time. Instead, it relies on a periodic polling interval governed by group_replication_flow_control_period (which defaults to 1 second).
Once a second, the cluster checks the size of the apply queues and the certifier queues.

The Vulnerability: If our application generates a massive spike of 50,000 writes in 500 milliseconds, the primary will happily accept and certify all of them. Flow control will not even notice the spike until the next 1 second polling interval hits. By the time it decides to apply a throttle, the damage is already done, and the secondary’s queue is already overflowing.

2. The PID Controller’s “Soft Brake” Math
When flow control does decide to throttle, it does not simply freeze the primary. It uses a PID (Proportional-Integral-Derivative) controller algorithm to calculate a “write quota” (the maximum number of transactions the primary is allowed to commit in the next second).
The PID controller is deliberately tuned to be gentle. It wants to gracefully degrade performance rather than cause immediate application timeouts.

When the secondary’s queue breaches the group_replication_flow_control_applier_threshold (default 25,000 transactions), the PID controller reduces the primary’s quota incrementally.
The Failure Point: If the primary’s incoming write rate is astronomically higher than the secondary’s disk IO capacity, this incremental “step down” in the quota is too slow. The primary is still allowed to write, say, 10,000 transactions per second, while the secondary is only applying 2,000. The queue continues to grow aggressively despite the throttle being “active.”

3. The Concurrency Mismatch (Parallel vs. Serial)
This is often the silent killer that defeats flow control. Flow control makes mathematical assumptions about how fast the secondary should be able to apply transactions based on recent history.
However, the primary node might be executing writes using hundreds of highly concurrent threads. The secondary relies on the parallel applier to keep up. If the incoming workload suddenly includes transactions that cannot be parallelized, such as writes hitting overlapping rows, cascading foreign key updates, or DDL statements, the secondary’s applier instantly drops from executing in parallel down to a single, serialized thread.
When this serialization happens, the secondary’s applier rate plummets instantly. Flow control, which only checks in once a second and adjusts gradually, cannot brake the primary fast enough to compensate for the secondary suddenly dropping to a crawl.
What can we do?
At the moment of writing there are only two things that can be done.

Make Flow control more aggressive
Increase the number of replication appliers

 
1. Making Flow Control More Aggressive
We can configure Flow Control to be a bit more aggressive. It will still remain a suggestion but a strong one.
How it works (The Configuration):

Lower the Threshold: By reducing group_replication_flow_control_applier_threshold (default is 25,000) to something like 1,000 or 500, we force the PID controller to kick in almost immediately when a spike occurs.
Remove the Safety Net: By keeping  group_replication_flow_control_min_quota to 0 (default), we remove the minimum write guarantee. If the secondary falls behind, Flow Control is allowed to throttle the primary’s writes down to zero, also if this will never happen.
Increase the Sensitivity: We can tweak the PID controller’s math (using the derivative and proportional tuning variables) to react much more aggressively to queue growth.
       group_replication_flow_control_hold_percent=100
       group_replication_flow_control_release_percent=5

 
The reality check, does it work?:
If the expectation is to have a rigid control over the applier queue on the lagging secondary, then the answer is NO. No matter what, at the moment flow control is not designed to act as we are used to in PXC (Percona Xtradb Cluster), where we have a rigid control of the pending queue also at the cost of delaying the writes. In Group Replication  the Flow Control will never bring the write to 0, the unfortunate aspect is that the mechanism is not enough to keep the queue under control.
 
2. Increasing Replication Appliers 
To help the secondary chew through the queue faster, we can increase the number of parallel threads it uses to write to disk.
How it works: We can increase the replica_parallel_workers (formerly slave_parallel_workers) setting. GR is exceptionally smart about this. Because of the certification process we discussed earlier, GR already knows exactly which transactions modify which rows. It uses a writeset-based dependency tracker to safely hand off non-conflicting transactions to multiple worker threads simultaneously.
The formula that is normally used to calculate the number of replication workers is to set 2.5 workers for each available core. IE if we have 14000m CPUs in our CR (K8) then we can assign ~35 workers, this is definitely higher than the default value of 4.   
The reality check, does it work?:  Yes, but only if our workload allows it.

The Catch – The Serialization Wall: Parallel appliers only work if the transactions do not conflict. If our application has 50 concurrent threads all trying to update the same “inventory count” row, or updating a highly contentious table, those transactions cannot be parallelized. The secondary’s coordinator thread will see the row-level conflicts and force those transactions to wait in line and execute sequentially. We could allocate 128 parallel workers, but 127 of them will sit idle while one thread does all the work.
The Catch – Context Switching: More threads do not magically create more disk IOPS. If we set the workers too high (e.g., beyond the physical CPU core count or disk IO capacity), the secondary’s InnoDB engine will spend more time context-switching and fighting over internal mutex locks than actually committing data. In many cases, over-allocating parallel workers actually slows down the apply rate.

Do we have any conclusions?
1. If HA is the goal, enforce Strict Flow Control
If our absolute top priority is High Availability, specifically achieving a near-zero Recovery Time Objective (RTO), we must configure an aggressive flow control.

The Logic: Fast failovers require small apply queues. To guarantee a small apply queue, we must strictly throttle the primary the millisecond the secondary starts to lag.
The Trade-off: we are protecting the cluster’s failover readiness at the expense of application write latency. If there is a massive write spike, our application will face timeouts and connection errors, but if the primary server suddenly catches fire, our database will recover and elect a new primary almost instantly.

The problem is that Group Replication is not able to act like that today, this is something we eventually need to implement to have better HA.
2. If Performance is the goal, relax Flow Control
If our top priority is keeping the application fast and ensuring COMMIT latencies remain extremely low, we should relax flow control or rely on the generous defaults.

The Logic: By relaxing flow control, we allow the primary to run at the absolute maximum speed its local disks and CPU allow. It does not care if the secondaries fall behind. Our application users remain happy and experience zero throttling.
The Trade-off: We are accepting severe risks to your HA posture. If the primary crashes while the secondaries have a massive apply queue, we will suffer a long write outage (the brownout) while the new primary catches up. Additionally, we are accepting the risk that the certification_info memory buffer will grow significantly on the primary and eventually have the pod OOMKilled .

3. Is this not what Asynchronous replication with semy-sync offers?
 
1. The Similarities
If we look purely at how a single transaction flows and how a failover behaves, GR and Semi-Sync look like twins:

The Durability Guarantee: Semi-Sync: The primary waits to commit until at least one secondary confirms it has received the transaction and written it to its local Relay Log. 

GR: The primary waits to commit until a majority quorum of nodes confirm they have received the transaction, certified it, and written it to their local relay logs.


The Failover Delay (The Queue):  In both systems, the secondary receiving the data does not mean the secondary has applied the data to its InnoDB tables.

If a crash happens, both systems require the new primary to completely execute its pending queue (Relay Log for Semi-Sync, Apply Queue for GR) before it is safe to accept new writes.



2. The Crucial Differences
If they behave so similarly, why use GR at all? 
The differences lie entirely in automation, consensus, and split-brain protection. Semi-Sync is just a data transport mechanism; GR is a full state-machine cluster.
Here is what GR gives you that Semi-Sync does not:

Automatic Election and Orchestration:

Semi-Sync: If the primary dies, Semi-Sync does nothing. The cluster sits there broken. You must rely on external tools (like Orchestrator or manual DBA intervention) to detect the crash, pick the most up-to-date secondary, wait for its relay log to apply, disable read_only, and re-point the application.
GR: The cluster detects the failure natively. The remaining nodes use Paxos consensus to elect a new primary automatically, manage the queue drain natively via the super_read_only flip we discussed, and self-heal.


Split-Brain Protection (Network Partitions):

Semi-Sync: If our network splits in half, an external failover tool might accidentally promote a secondary while the old primary is still alive and accepting writes. We now have a split-brain, and our data is permanently corrupted.
GR: GR enforces strict quorum. If a network split happens, the side of the network with the minority of nodes will automatically fence itself off and refuse all writes. Split-brain is mathematically prevented.


The Certification Database:

As we established, GR requires the certification map to ensure the new primary doesn’t accept writes that conflict with its unapplied queue. Semi-Sync does not have this; it relies entirely on the external failover tool to guarantee no writes touch the new primary until the relay log is 100% applied.



3. Final observation
If we are using Single-Primary GR with relaxed flow control, we have essentially built a highly-automated, consensus-driven version of Semi-Sync replication. 
We have the exact same apply-queue bottleneck during failover, but we have traded the need for external orchestrator tools for built-in Paxos consensus and native split-brain protection.
 
Conclusions (for real)
When we run MySQL on a traditional, dedicated Virtual Machine, memory limits are “soft.” If the certification_info database explodes and consumes an extra 10GB of RAM because of the applier lag, the Linux OS might start aggressively swapping inactive pages to disk, but the MySQL process usually survives. Performance degrades, but the database stays online.
In Kubernetes, memory limits are “hard.” As we discussed earlier, Kubernetes enforces pod memory limits via cgroups v2 (memory.max). The Linux kernel’s OOM Killer has no understanding of database quorum, failover states, or apply queues. It only sees math: Working Set Size &amp;gt; memory.max = Terminate Process (Exit Code 137).
The Chain Reaction of Relaxed Flow Control in k8s
If we prioritize “performance” by relaxing Flow Control in a Kubernetes environment, we are essentially setting a ticking time bomb. Here is the chain of events:

The Spike: Our application experiences a massive write spike.
The Queue: The secondary pod’s disk cannot keep up, and its applier queue grows to 1,000,000 transactions.
The Memory Sprawl: Because the queue is large, the global low-watermark stalls. The Primary pod is forbidden from garbage collecting the certification_info map. The in-memory hash map balloons in size.
The Execution: The memory.current metric will reach the memory.max, kernel will trigger the OMMKill process. First action will be to try to free the page.cache related to the process. If the purge is successful and the memory.current is less than memory.max then the process will persist, otherwise the kernel will kill it. 
We can use the WSS metric to predict a successful OMMKill.
 The Primary pod’s Working Set Size (WSS) breaches its Kubernetes memory limit, this is a fair estimate not an absolute value.
The Catastrophe: The Linux OOM Killer instantly assassinates the Primary MySQL process.

Because we tried to avoid a few seconds of write latency by keeping relaxed Flow Control, we inadvertently caused a hard crash of the primary database pod, with long write downtime.
The Architectural Law
Therefore, here is my statement as architectural law for containerized environments: In Kubernetes, High Availability and Pod stability are so intrinsically linked that Flow Control must act as hard as it can to cap the apply queue.

We cannot allow unbounded memory growth in a container. The only way to bound certification_info memory is to bound the apply queue.
The only way to bound the apply queue is with strict, aggressive Flow Control.
Increasing the number of replication appliers helps but is not the conclusive answer.

In a Kubernetes environment, we must tune group_replication_flow_control_applier_threshold to a strict, low number, and accept that during massive traffic spikes, our application will experience write throttling. It is infinitely better for our application’s connection pool to wait 2 seconds for a COMMIT to succeed than for the primary database pod to be violently OOMKilled by the kernel, and have to wait for minutes or hours to recover write capabilities.
Note
Just as a mention this is exactly how Percona Operator with Percona Xtradb Cluster works. To be more specific, PXC and in general solutions based on Galera have a Flow Control mechanism that enforces the queue to be inside hard limits. While this more invasive control may be noticeable at application level, it guarantees that the other nodes are not lagging behind the primary and this is why it is a stronger HA solution in the Kubernetes environment.
 
Reference
https://github.com/Tusamarco/mysqloperatorcalculator
Managing Resources and OOMKills: Resource Management for Pods and Containers (This page details how memory limits are enforced reactively by the Linux kernel via OOM kills).
How WSS triggers Evictions: Node-pressure Eviction (This page explicitly details how the kubelet uses the memory.available signal, which is derived from node capacity minus the working set size).
Latest changes. Pointer to the code 
Swap Memory Management (Core Concepts &amp;amp; Configuration): https://kubernetes.io/docs/concepts/cluster-administration/swap-memory-management/
The post The Failover Brownout: Rethinking High Availability in MySQL Group Replication appeared first on Percona.</description>
    <content:encoded><![CDATA[<p><span>It is time to talk again about Flow control and group replication. This time with a special eye on the use of Group Replication in the Kubernetes context. In this article we will dig a bit on how it works and what are the various side effects. </span></p>
<p> </p>
<h2><span>The problem</span></h2>
<p><span>Recently I was refining the calculation I use in the </span><a href="https://github.com/Tusamarco/mysqloperatorcalculator"><span>MySQL calculator for Operator</span></a><span> given I was constantly encountering a very serious problem with the Percona Server Operator.</span></p>
<p><span>The problem is that when the deployment was/is serving a high level of traffic, it will, no matter what, end up in getting OMMKill by the K8 system. </span></p>
<p><span>This because the pod was gradually consuming more and more memory, reaching the memory limit set in the CR specification. </span></p>
<p> </p>
<p><span>Now let me clarify a few things, to get straight to the facts.</span></p>
<p><span>Kubernetes itself does not OOMKill a pod for hitting its memory limit, the mechanism works as described below with mention on how Working Set Size (WSS) is calculated, and how OOMKills are triggered, and in the resource sections, the links to the official documentation and source code.</span></p>
<p> </p>
<h3><span>1. The Reality of OOMKills vs. Kubelet Evictions</span></h3>
<p><span>It is crucial to distinguish between what the Linux kernel does and what Kubernetes does:</span></p>
<ul>
<li aria-level="1"><b>OOMKilled (Exit Code 137):</b><span> This is executed entirely by the </span><b>Linux kernel’s OOM Killer</b><span>, not Kubernetes. When we set a memory limit in our Pod spec, Kubernetes translates that into a Linux cgroup constraint (</span><span>memory.limit_in_bytes</span><span> for cgroups v1, or </span><span>memory.max</span><span> for cgroups v2). If our container attempts to allocate more memory than this hard limit, and the kernel cannot reclaim any page cache (like inactive files), the kernel directly intervenes and terminates the process.</span></li>
<li aria-level="1"><b>Node-Pressure Evictions:</b><span> This is where Kubernetes actively observes memory. The </span><span>kubelet</span><span> monitors the </span><span>working_set_bytes</span><span> metric to protect the </span><i><span>node</span></i><span> from running out of memory. If the node’s memory drops below an eviction threshold, Kubernetes will actively evict pods to prevent the kernel from initiating a system-wide OOM kill.</span></li>
</ul>
<h3><span>2. How Working Set Size (WSS) is Calculated for the container</span></h3>
<p><span>Kubernetes monitors container memory via </span><b>cAdvisor</b><span>, which is integrated directly into the </span><span>kubelet</span><span>. cAdvisor calculates the Working Set Size by taking the total memory usage and subtracting the inactive file cache (memory that the kernel can easily reclaim if it faces memory pressure).</span></p>
<p><span>Because active file caches and anonymous memory (like our application’s heap) cannot be easily evicted, this working set metric is the most accurate representation of the memory your container is forcing the system to hold.</span></p>
<p> </p>
<p><span>The Calculation &amp; cgroups Evolution The core mathematical calculation is </span><i><span>Memory Usage</span></i><i><span> – </span></i><i><span>Inactive File Cache</span></i><span>, but </span><i><span>how</span></i><span> cAdvisor fetches this data from the Linux kernel depends entirely on your node’s cgroup version. Modern cAdvisor relies heavily on the </span><span>opencontainers/runc/libcontainer</span><span> library to read these raw cgroup files:</span></p>
<ul>
<li aria-level="1"><span>cgroups v1: cAdvisor starts with the raw usage from </span><span>memory.usage_in_bytes</span><span> and subtracts the reclaimable cache found under the </span><span>total_inactive_file</span><span> key.</span></li>
<li aria-level="1"><span>cgroups v2 (Unified): cAdvisor starts with the raw usage from </span><span>memory.current</span><span> and subtracts the reclaimable cache found under the </span><span>inactive_file</span><span> key.</span></li>
</ul>
<p> </p>
<p><span>The Underlying Code Logic While older versions used a static </span><span>setMemoryStats</span><span> function, modern Kubernetes branches handle this dynamically. The logic executes the following flow before reporting back to the </span><span>kubelet</span><span>:</span></p>
<ol>
<li aria-level="1"><span>Detects Version: It identifies whether the node runs cgroups v1 or v2 to determine the correct inactive file key name.</span></li>
<li aria-level="1"><span>Fetch Usage: It pulls the raw memory usage from the container.</span></li>
<li aria-level="1"><span>Subtract Cache: It looks up the inactive file value and safely subtracts it from the usage (including a safeguard to ensure the working set never drops below zero).</span></li>
<li aria-level="1"><span>Report Metric: It sets this final calculated value as </span><span>container_memory_working_set_bytes</span><span>, which the </span><span>kubelet</span><span> then uses to decide if the node is under memory pressure.</span></li>
</ol>
<h2><span>Back to us </span></h2>
<p><span>At the end the point is that if our pod reaches the limit and we ARE NOT using the new </span><a href="https://docs.google.com/document/d/1WSoJxaAPMP4tdiT_-U4YwgoHBI8zKJ0Hw-y6iUXJQBc/edit#bookmark=id.59zqn83hsx1p"><span>swap feature</span></a><span> existing in Kubernetes, our pod will be brutally killed, and in 99% of the cases our production will suffer a lot. !Ops spoiler!</span></p>
<p> </p>
<p><span>To clearly understand what was causing the issue about this memory consumption and having my calculator fail, I started to collect the information about the memory usage in MySQL itself.</span></p>
<p> </p>
<p><span>SELECT EVENT_NAME,CURRENT_NUMBER_OF_BYTES_USED / 1024 / 1024 AS current_usage_mb FROM performance_schema.memory_summary_global_by_event_name WHERE EVENT_NAME like ‘memory/%’ and EVENT_NAME not like ‘memory/performance%’  order by current_usage_mb desc limit 25;</span></p>
<p><span>Which will give you and output like this:</span></p><pre class="urvanov-syntax-highlighter-plain-tag">+---------------------------------------+------------------+
| EVENT_NAME                            | current_usage_mb |
+---------------------------------------+------------------+
| memory/innodb/buf_buf_pool            |   46398.92578125 |
| memory/group_rpl/GCS_XCom::xcom_cache |    1066.66179943 |
| memory/group_rpl/certification_info   |      92.45250702 |
| memory/innodb/log_buffer_memory       |      64.00096130 |
| memory/sql/TABLE                      |      49.90627003 |
| memory/innodb/memory                  |      34.68734741 |
| memory/innodb/ut0link_buf             |      24.00006104 |
| memory/innodb/lock0lock               |      21.40064240 |
| memory/mysqld_openssl/openssl_malloc  |       9.51009655 |
| memory/innodb/read0read               |       8.19496155 |
| memory/mysys/KEY_CACHE                |       8.00215149 |
| memory/innodb/sync0arr                |       7.03147125 |
| memory/innodb/ha_innodb               |       6.87006950 |
| memory/innodb/lock_sys                |       5.25009155 |
| memory/sql/log_sink_pfs               |       5.00003052 |
| memory/innodb/ut0pool                 |       4.00017548 |
| memory/sql/dd::objects                |       2.83031464 |
| memory/innodb/std                     |       2.72618866 |
| memory/innodb/os0file                 |       2.63054657 |
| memory/innodb/os0event                |       2.34302521 |
| memory/sql/TABLE_SHARE::mem_root      |       2.31734467 |
| memory/innodb/trx0trx                 |       2.22647858 |
| memory/temptable/physical_ram         |       1.00003052 |
| memory/sql/dd::String_type            |       0.94942093 |
| memory/innodb/btr0pcur                |       0.89743423 |
+---------------------------------------+------------------+</pre><p> </p>
<p><span>Plus I used PMM to collect memory information </span></p>
<p><img fetchpriority="high" decoding="async" class="alignnone wp-image-49857 size-medium_large" src="https://www.percona.com/wp-content/uploads/2026/06/allocation_with_incidents_describe-768x395.jpg" alt="" width="768" height="395" srcset="https://www.percona.com/wp-content/uploads/2026/06/allocation_with_incidents_describe-768x395.jpg 768w, https://www.percona.com/wp-content/uploads/2026/06/allocation_with_incidents_describe-300x154.jpg 300w, https://www.percona.com/wp-content/uploads/2026/06/allocation_with_incidents_describe-1024x526.jpg 1024w, https://www.percona.com/wp-content/uploads/2026/06/allocation_with_incidents_describe.jpg 1493w" sizes="(max-width: 768px) 100vw, 768px"></p>
<p><span>To simulate the load I used the sysbench-tpcc (tpc-c derivate test) variant and run the tests simulating a load of 1024 threads against a cluster based on machine with 16 Core and 64Gb volumes ~3k IOPS, so not gigantic but not small. </span></p>
<p> </p>
<p><span>The finding was almost immediate:</span></p><pre class="urvanov-syntax-highlighter-plain-tag">+---------------------------------------+------------------+
| EVENT_NAME                            | current_usage_mb |
+---------------------------------------+------------------+
| memory/innodb/buf_buf_pool            |   46398.92578125 |
| memory/group_rpl/certification_info   |    1431.67934418 | &lt;constantly increasing
| memory/group_rpl/GCS_XCom::xcom_cache |    1066.63542366 |
| memory/sql/Gtid_set::Interval_chunk   |      95.52413940 |
| memory/innodb/log_buffer_memory       |      64.00096130 |
| memory/sql/TABLE                      |      48.17613125 |
| memory/innodb/memory                  |      35.08897400 |
| memory/innodb/ut0link_buf             |      24.00006104 |
| memory/innodb/lock0lock               |      21.40064240 |
| memory/innodb/read0read               |      14.86782837 |
| memory/mysqld_openssl/openssl_malloc  |      12.05916119 |
| memory/mysys/KEY_CACHE                |       8.00215149 |
| memory/innodb/sync0arr                |       7.03147125 |
| memory/innodb/ha_innodb               |       6.84074974 |
| memory/innodb/lock_sys                |       5.25009155 |
| memory/sql/log_sink_pfs               |       5.00003052 |
| memory/innodb/ut0pool                 |       4.00017548 |
| memory/sql/dd::objects                |       2.82012177 |
| memory/innodb/std                     |       2.72515869 |
| memory/innodb/os0file                 |       2.63054657 |
| memory/innodb/os0event                |       2.35884857 |
| memory/innodb/trx0trx                 |       2.22647858 |
| memory/sql/TABLE_SHARE::mem_root      |       1.83777618 |
| memory/innodb/trx0undo                |       1.26304626 |
| memory/mysys/lf_node                  |       1.08828735 |
+---------------------------------------+------------------+</pre><p> </p>
<p><span><br>
</span><span>Ok then … What is the certification info???</span></p>
<h2><span>What is group_rpl/certification_info?</span></h2>
<p><span>In MySQL, </span><span>memory/group_rpl/certification_info</span><span> is a Performance Schema memory instrument. It tracks the exact amount of RAM allocated to store the Certification Database (or Certification Info).</span></p>
<p><span>In Group Replication, nodes do not lock rows across the network while a transaction is executing. Instead, transactions execute locally and optimistically. When it is time to commit, the transaction undergoes a </span><i><span>Certification Process</span></i><span> to ensure no other concurrent transaction in the cluster has modified the exact same rows. The </span><span>certification_info</span><span> buffer is the in-memory hash map that makes this conflict detection possible.</span></p>
<h3><span>1. What is it used for?</span></h3>
<p><span>The </span><span>certification_info</span><span> structure acts as a tracking ledger for recently modified rows.</span></p>
<p><span>Here is how it works under the hood:</span></p>
<ul>
<li aria-level="1"><span>The Key-Value Pair: It is fundamentally an in-memory dictionary. The </span><i><span>key</span></i><span> is the hash of a modified row (extracted from the transaction’s “write set”), and the </span><i><span>value</span></i><span> is the Global Transaction Identifier (GTID) of the transaction that successfully modified it.</span></li>
<li aria-level="1"><span>Conflict Detection: When a new transaction attempts to commit, it broadcasts its write set and the “snapshot version” of the database it saw when it started. The certifier cross-references the incoming transaction’s write set against the </span><span>certification_info</span><span> map.</span></li>
<li aria-level="1"><span>The Decision: If the </span><span>certification_info</span><span> shows that a row was modified by a newer GTID that the incoming transaction did not “see” when it started, a conflict is flagged, and the transaction is aborted. If no conflict exists, the transaction is certified, and the </span><span>certification_info</span><span> map is updated with the new write set and GTID.</span></li>
</ul>
<p><span>The primary does not hold onto this memory out of stubbornness; it does so because purging that data too early would destroy the cluster’s consistency in the event of a failover.</span></p>
<p> </p>
<p><span>In Group Replication, garbage collection for the </span><span>certification_info</span><span> buffer is not triggered just because a transaction commits on the primary. It is triggered by a concept called the Stable Set. </span></p>
<p><span>Every node in the cluster periodically broadcasts a message to the rest of the group saying, </span><i><span>“Here are the GTIDs I have successfully applied to my disk.”</span></i><span> The cluster then calculates a </span><i><span>global low watermark</span></i><span>. This watermark is the highest transaction GTID that </span><i><span>every single member</span></i><span> of the group has successfully applied. Garbage collection is only allowed to purge write-sets from the certification database that fall </span><i><span>below</span></i><span> this global watermark. </span><span><br>
</span><span>To note that this purge is a synchronous operation during which writes are forbidden.</span></p>
<h3><span>2. How the Apply Queue Stalls the Watermark</span></h3>
<p><span>When a secondary node starts lagging, its </span><i><span>applier queue</span></i><span> grows. This means the secondary is receiving transactions from the network quickly, but its SQL thread is too slow to actually execute them and commit them to disk.</span></p>
<p><span>Because the secondary hasn’t applied these transactions, it cannot report those GTIDs back to the group as “finished.”</span></p>
<ul>
<li aria-level="1"><span>The lagging secondary’s local watermark stalls.</span></li>
<li aria-level="1"><span>Therefore, the </span><i><span>global low watermark</span></i><span> for the entire cluster stalls.</span></li>
<li aria-level="1"><span>Because the global watermark hasn’t moved forward, the </span><span>garbage_collect</span><span> function on the primary (and all other nodes) says, </span><i><span>“I am not allowed to delete any write-sets yet.”</span></i></li>
<li aria-level="1"><span>As the primary continues to process new writes, the </span><span>certification_info</span><span> memory buffer grows continuously.</span></li>
</ul>
<h3><span>3. Why the Primary Cannot Purge Early</span></h3>
<p><span>we might wonder: </span><i><span>If the transaction is already committed on the primary, why does the primary care if the secondary has applied it? Why not just drop the write-set from its own memory?</span></i></p>
<p><span>The answer comes down to </span><i><span>Failover Safety</span></i><span> and </span><i><span>Distributed Conflict Detection</span></i><span>. GR is a shared-nothing, decentralized architecture. Even if you are running in Single-Primary  mode (keep this in mind will be important later), the underlying engine uses the exact same logic as Multi-Primary mode. </span></p>
<p><span>Here is why the primary is forbidden from purging that data:</span></p>
<ul>
<li aria-level="1"><span>The Failover Scenario: Imagine our primary node crashes right now. The lagging secondary (which still has a massive apply queue) is immediately elected as the new primary.</span></li>
<li aria-level="1"><span>The Conflict Risk: As the new primary, it starts accepting new writes from your application. However, it still has thousands of old transactions in its applier queue that it hasn’t written to disk yet!</span></li>
<li aria-level="1"><span>The Necessity of the Buffer: When a new write comes in, the new primary </span><i><span>must</span></i><span> check if that write conflicts with any of the pending transactions in its apply queue. It does this by checking the </span><span>certification_info</span><span> map. If the old primary had purged the global certification data early, the new primary wouldn’t have the write-sets for those pending transactions. It would blindly accept the new write, causing a massive data conflict and breaking the replication group entirely.</span></li>
</ul>
<p><span>Fine Marco, then what is the effect of this?</span></p>
<p> </p>
<p><span>Well, drums roll …</span></p>
<p><span>… When a secondary node is elected as the new primary during a failover, it does not immediately open the floodgates to new writes. </span><b>It keeps its </b><b><i>super_read_only</i></b><b> variable set to ON until it has completely drained its local apply queue of all transactions that were certified prior to the election.</b></p>
<p><span>This is an intentional design choice to guarantee that the new primary’s state is completely consistent with the old primary before it starts accepting new data.</span></p>
<p> </p>
<h3><span>4. Immediate Write Rejections (No Built-in Queuing)</span></h3>
<p><span>The most critical impact to understand is that the new primary does not queue or pause new incoming writes while it catches up. It outright rejects them.</span></p>
<p><span>If our application or proxy routes a COMMIT, INSERT, UPDATE, or DELETE to the new primary while it is still processing the old queue, MySQL will immediately throw an error back to the client:</span></p>
<p><span>ERROR 1290 (HY000): The MySQL server is running with the –super-read-only option so it cannot execute this statement</span></p>
<h3><span>5. The “Brownout” Window (Write Outage)</span></h3>
<p><span>Because of this behavior, a failover in MySQL Group Replication does not instantly restore write availability. Our cluster experiences a “brownout”, a period where reads might succeed, but writes are entirely blocked.</span></p>
<p><span>The duration of this write outage is directly proportional to the size of the apply queue.</span></p>
<ul>
<li aria-level="1"><span>If the secondary was fully caught up, write availability is restored in milliseconds.</span></li>
<li aria-level="1"><span>If the secondary was lagging by 50 minutes, your application will suffer a 50 minute write outage while the node applies the backlog.</span></li>
</ul>
<h3><span>6. Impact on Proxies (e.g., MySQL Router or ProxySQL)</span></h3>
<p><span>If we are using a proxy layer to route your database traffic, the apply queue dictates how the proxy behaves during the transition:</span></p>
<ul>
<li aria-level="1"><span>MySQL Router: It continuously monitors the cluster topology and the super_read_only flag. Even though the node has technically been elected primary, Router will not open the read-write port to it until the apply queue drains and super_read_only flips to OFF. Depending on your application timeouts, client connections will either hang waiting for a writable connection or fail completely.</span></li>
<li aria-level="1"><span>ProxySQL: Similar to Router, if it is configured to check for the read_only state, it will temporarily quarantine the new primary from the write hostgroup.</span></li>
<li aria-level="1"><span>HAProxy (in Operator): Monitor both Primary state and read_only state, but it expose the Primary to writes causing the application to fail (bug we need to fix)  </span></li>
</ul>
<h3><span>7. Read Traffic and Stale Data</span></h3>
<p><span>During this catch-up phase, the node will accept incoming </span><span>SELECT</span><span> queries (since it is still a valid database). However, because it is actively churning through the old primary’s backlog, the data being read is temporarily stale.</span></p>
<p><span>If your application reads a row that is sitting in the apply queue but hasn’t been committed to disk yet, it will get the old version of that row.</span></p>
<h2><span>Why Flow Control is Critical</span></h2>
<p><span>Because a large apply queue turns a seamless failover into a severe, application-breaking write outage, Group Replication includes the Flow Control feature.</span></p>
<p><span>Flow Control monitors the size of the apply queues across all secondaries. If a secondary starts lagging too far behind, Flow Control should actively throttle the write throughput on the </span><i><span>current</span></i><span> primary to allow the lagging node to catch up. It is essentially a trade-off: we accept a slight performance hit during normal operations to guarantee that your database recovers almost instantly during a failover.</span></p>
<p><b>However, this is not what really happens</b><span>.</span></p>
<h3><span>1. It is Reactive, Not Proactive (The Polling Blind Spot)</span></h3>
<p><span>Flow control does not intercept and evaluate every single transaction in real-time. Instead, it relies on a periodic polling interval governed by </span><span>group_replication_flow_control_period</span><span> (which defaults to 1 second).</span></p>
<p><span>Once a second, the cluster checks the size of the apply queues and the certifier queues.</span></p>
<ul>
<li aria-level="1"><span>The Vulnerability: If our application generates a massive spike of 50,000 writes in 500 milliseconds, the primary will happily accept and certify all of them. Flow control will not even notice the spike until the next 1 second polling interval hits. By the time it decides to apply a throttle, the damage is already done, and the secondary’s queue is already overflowing.</span></li>
</ul>
<h3><span>2. The PID Controller’s “Soft Brake” Math</span></h3>
<p><span>When flow control does decide to throttle, it does not simply freeze the primary. It uses a PID (Proportional-Integral-Derivative) controller algorithm to calculate a “write quota” (the maximum number of transactions the primary is allowed to commit in the next second).</span></p>
<p><span>The PID controller is deliberately tuned to be gentle. It wants to gracefully degrade performance rather than cause immediate application timeouts.</span></p>
<ul>
<li aria-level="1"><span>When the secondary’s queue breaches the </span><span>group_replication_flow_control_applier_threshold</span><span> (default 25,000 transactions), the PID controller reduces the primary’s quota incrementally.</span></li>
<li aria-level="1"><span>The Failure Point: If the primary’s incoming write rate is astronomically higher than the secondary’s disk IO capacity, this incremental “step down” in the quota is too slow. The primary is still allowed to write, say, 10,000 transactions per second, while the secondary is only applying 2,000. The queue continues to grow aggressively despite the throttle being “active.”</span></li>
</ul>
<h3><span>3. The Concurrency Mismatch (Parallel vs. Serial)</span></h3>
<p><span>This is often the silent killer that defeats flow control. Flow control makes mathematical assumptions about how fast the secondary </span><i><span>should</span></i><span> be able to apply transactions based on recent history.</span></p>
<p><span>However, the primary node might be executing writes using hundreds of highly concurrent threads. The secondary relies on the parallel applier to keep up. If the incoming workload suddenly includes transactions that cannot be parallelized, such as writes hitting overlapping rows, cascading foreign key updates, or DDL statements, the secondary’s applier instantly drops from executing in parallel down to a single, serialized thread.</span></p>
<p><span>When this serialization happens, the secondary’s applier rate plummets instantly. Flow control, which only checks in once a second and adjusts gradually, cannot brake the primary fast enough to compensate for the secondary suddenly dropping to a crawl.</span></p>
<h2><span>What can we do?</span></h2>
<p><span>At the moment of writing there are only two things that can be done.</span></p>
<ol>
<li aria-level="1"><span>Make Flow control more aggressive</span></li>
<li aria-level="1"><span>Increase the number of replication appliers</span></li>
</ol>
<p> </p>
<h3><span>1. Making Flow Control More Aggressive</span></h3>
<p><span>We can configure Flow Control to be a bit more aggressive. It will still remain a </span><i><span>suggestion</span></i><span> but a strong one.</span></p>
<p><span>How it works (The Configuration):</span></p>
<ul>
<li aria-level="1"><span>Lower the Threshold: By reducing </span><span>group_replication_flow_control_applier_threshold</span><span> (default is 25,000) to something like 1,000 or 500, we force the PID controller to kick in almost immediately when a spike occurs.</span></li>
<li aria-level="1"><span>Remove the Safety Net: By keeping  </span><span>group_replication_flow_control_min_quota</span><span> to </span><span>0 </span><span>(default), we remove the minimum write guarantee. If the secondary falls behind, Flow Control is allowed to throttle the primary’s writes down to zero, also if this will never happen.</span></li>
<li aria-level="1"><span>Increase the Sensitivity: We can tweak the PID controller’s math (using the derivative and proportional tuning variables) to react much more aggressively to queue growth.</span><span><br>
</span><span>       group_replication_flow_control_hold_percent=100</span><span><br>
</span><span>       group_replication_flow_control_release_percent=5</span></li>
</ul>
<p> </p>
<p><b>The reality check, does it work?:</b></p>
<p><span>If the expectation is to have a rigid control over the applier queue on the lagging secondary, then the answer is </span><b>NO</b><span>. No matter what, at the moment flow control is not designed to act as we are used to in PXC (Percona Xtradb Cluster), where we have a rigid control of the pending queue also at the cost of delaying the writes. In Group Replication  the Flow Control will never bring the write to 0, the unfortunate aspect is that the mechanism is not enough to keep the queue under control.</span></p>
<p> </p>
<h3><span>2. Increasing Replication Appliers </span></h3>
<p><span>To help the secondary chew through the queue faster, we can increase the number of parallel threads it uses to write to disk.</span></p>
<p><b>How it works</b><span>: We can increase the </span><span>replica_parallel_workers</span><span> (formerly </span><span>slave_parallel_workers</span><span>) setting. GR is exceptionally smart about this. Because of the certification process we discussed earlier, GR already knows exactly which transactions modify which rows. It uses a writeset-based dependency tracker to safely hand off non-conflicting transactions to multiple worker threads simultaneously.</span><span><br>
</span><span>The formula that is normally used to calculate the number of replication workers is to set 2.5 workers for each available core. IE if we have 14000m CPUs in our CR (K8) then we can assign ~35 workers, this is definitely higher than the default value of 4.   </span></p>
<p><b>The reality check, does it work?</b><span>:  </span><b>Yes</b><span>, but only if our workload allows it.</span></p>
<ul>
<li aria-level="1"><span>The Catch – The Serialization Wall: Parallel appliers only work if the transactions do not conflict. If our application has 50 concurrent threads all trying to update the same “inventory count” row, or updating a highly contentious table, those transactions </span><i><span>cannot</span></i><span> be parallelized. The secondary’s coordinator thread will see the row-level conflicts and force those transactions to wait in line and execute sequentially. We could allocate 128 parallel workers, but 127 of them will sit idle while one thread does all the work.</span></li>
<li aria-level="1"><span>The Catch – Context Switching: More threads do not magically create more disk IOPS. If we set the workers too high (e.g., beyond the physical CPU core count or disk IO capacity), the secondary’s InnoDB engine will spend more time context-switching and fighting over internal mutex locks than actually committing data. In many cases, over-allocating parallel workers actually </span><i><span>slows down</span></i><span> the apply rate.</span></li>
</ul>
<h2><span>Do we have any conclusions?</span></h2>
<h3><span>1. If HA is the goal, enforce Strict Flow Control</span></h3>
<p><span>If our absolute top priority is High Availability, specifically achieving a near-zero Recovery Time Objective (RTO), we must configure an aggressive flow control.</span></p>
<ul>
<li aria-level="1"><span>The Logic: Fast failovers require small apply queues. To guarantee a small apply queue, we must strictly throttle the primary the millisecond the secondary starts to lag.</span></li>
<li aria-level="1"><span>The Trade-off: we are protecting the cluster’s failover readiness at the expense of application write latency. If there is a massive write spike, our application will face timeouts and connection errors, but if the primary server suddenly catches fire, our database will recover and elect a new primary almost instantly.</span></li>
</ul>
<p><span>The problem is that Group Replication is not able to act like that today, this is something we eventually need to implement to have better HA.</span></p>
<h3><span>2. If Performance is the goal, relax Flow Control</span></h3>
<p><span>If our top priority is keeping the application fast and ensuring </span><span>COMMIT</span><span> latencies remain extremely low, we should relax flow control or rely on the generous defaults.</span></p>
<ul>
<li aria-level="1"><span>The Logic: By relaxing flow control, we allow the primary to run at the absolute maximum speed its local disks and CPU allow. It does not care if the secondaries fall behind. Our application users remain happy and experience zero throttling.</span></li>
<li aria-level="1"><span>The Trade-off: We are accepting severe risks to your HA posture. If the primary crashes while the secondaries have a massive apply queue, we will suffer a long write outage (the brownout) while the new primary catches up. Additionally, we are accepting the risk that the </span><span>certification_info</span><span> memory buffer will grow significantly on the primary and eventually have the pod OOMKilled .</span></li>
</ul>
<h3><span>3. Is this not what Asynchronous replication with semy-sync offers?</span></h3>
<p> </p>
<h4><i><span>1. The Similarities</span></i></h4>
<p><span>If we look purely at how a single transaction flows and how a failover behaves, GR and Semi-Sync look like twins:</span></p>
<ul>
<li aria-level="1"><span>The Durability Guarantee: </span><i><span>Semi-Sync:</span></i><span> The primary waits to commit until at least one secondary confirms it has received the transaction and written it to its local Relay Log. </span>
<ul>
<li aria-level="2"><i><span>GR:</span></i><span> The primary waits to commit until a majority quorum of nodes confirm they have received the transaction, certified it, and written it to their local relay logs.</span></li>
</ul>
</li>
<li aria-level="1"><span>The Failover Delay (The Queue):  In both systems, the secondary receiving the data does not mean the secondary has applied the data to its InnoDB tables.</span>
<ul>
<li aria-level="2"><span>If a crash happens, both systems require the new primary to completely execute its pending queue (Relay Log for Semi-Sync, Apply Queue for GR) before it is safe to accept new writes.</span></li>
</ul>
</li>
</ul>
<h4><i><span>2. The Crucial Differences</span></i></h4>
<p><span>If they behave so similarly, why use GR at all? </span><span><br>
</span><span>The differences lie entirely in automation, consensus, and split-brain protection. Semi-Sync is just a data transport mechanism; GR is a full state-machine cluster.</span></p>
<p><span>Here is what GR gives you that Semi-Sync does not:</span></p>
<ul>
<li aria-level="1"><span>Automatic Election and Orchestration:</span>
<ul>
<li aria-level="2"><i><span>Semi-Sync:</span></i><span> If the primary dies, Semi-Sync does nothing. The cluster sits there broken. You must rely on external tools (like Orchestrator or manual DBA intervention) to detect the crash, pick the most up-to-date secondary, wait for its relay log to apply, disable </span><span>read_only</span><span>, and re-point the application.</span></li>
<li aria-level="2"><i><span>GR:</span></i><span> The cluster detects the failure natively. The remaining nodes use Paxos consensus to elect a new primary automatically, manage the queue drain natively via the </span><span>super_read_only</span><span> flip we discussed, and self-heal.</span></li>
</ul>
</li>
<li aria-level="1"><span>Split-Brain Protection (Network Partitions):</span>
<ul>
<li aria-level="2"><i><span>Semi-Sync:</span></i><span> If our network splits in half, an external failover tool might accidentally promote a secondary while the old primary is still alive and accepting writes. We now have a split-brain, and our data is permanently corrupted.</span></li>
<li aria-level="2"><i><span>GR:</span></i><span> GR enforces strict quorum. If a network split happens, the side of the network with the minority of nodes will automatically fence itself off and refuse all writes. Split-brain is mathematically prevented.</span></li>
</ul>
</li>
<li aria-level="1"><span>The Certification Database:</span>
<ul>
<li aria-level="2"><span>As we established, GR requires the certification map to ensure the new primary doesn’t accept writes that conflict with its unapplied queue. Semi-Sync does not have this; it relies entirely on the external failover tool to guarantee no writes touch the new primary until the relay log is 100% applied.</span></li>
</ul>
</li>
</ul>
<h4><i><span>3. Final observation</span></i></h4>
<p><span>If we are using Single-Primary GR with relaxed flow control, we have essentially built a highly-automated, consensus-driven version of Semi-Sync replication. </span></p>
<p><span>We have the exact same apply-queue bottleneck during failover, but we have traded the need for external orchestrator tools for built-in Paxos consensus and native split-brain protection.</span></p>
<p> </p>
<h2><span>Conclusions (for real)</span></h2>
<p><span>When we run MySQL on a traditional, dedicated Virtual Machine, memory limits are “soft.” If the </span><span>certification_info</span><span> database explodes and consumes an extra 10GB of RAM because of the applier lag, the Linux OS might start aggressively swapping inactive pages to disk, but the MySQL process usually survives. Performance degrades, but the database stays online.</span></p>
<p><span>In Kubernetes, memory limits are “hard.” As we discussed earlier, Kubernetes enforces pod memory limits via cgroups v2 (</span><span>memory.max</span><span>). The Linux kernel’s OOM Killer has no understanding of database quorum, failover states, or apply queues. It only sees math: </span><i><span>Working Set Size &gt; </span></i><i><span>memory.max</span></i><i><span> = Terminate Process (Exit Code 137).</span></i></p>
<h3><span>The Chain Reaction of Relaxed Flow Control in k8s</span></h3>
<p><span>If we prioritize “performance” by relaxing Flow Control in a Kubernetes environment, we are essentially setting a ticking time bomb. Here is the chain of events:</span></p>
<ol>
<li aria-level="1"><b>The Spike</b><span>: Our application experiences a massive write spike.</span></li>
<li aria-level="1"><b>The Queue</b><span>: The secondary pod’s disk cannot keep up, and its applier queue grows to 1,000,000 transactions.</span></li>
<li aria-level="1"><b>The Memory Sprawl</b><span>: Because the queue is large, the global low-watermark stalls. The Primary pod is forbidden from garbage collecting the </span><span>certification_info</span><span> map. The in-memory hash map balloons in size.</span></li>
<li aria-level="1"><b>The Execution</b><span>: The </span><i><span>memory.current</span></i><span> metric will reach the </span><i><span>memory.max</span></i><span>, kernel will trigger the OMMKill process. First action will be to try to free the page.cache related to the process. If the purge is successful and the memory.current is less than </span><i><span>memory.max</span></i><span> then the process will persist, otherwise the kernel will kill it. </span><span><br>
</span><span>We can use the WSS metric to predict a successful OMMKill.</span><span><br>
</span><span> The Primary pod’s Working Set Size (WSS) breaches its Kubernetes memory limit, this is a fair estimate not an absolute value.</span></li>
<li aria-level="1"><b>The Catastrophe</b><span>: The Linux OOM Killer instantly assassinates the Primary MySQL process.</span></li>
</ol>
<p><span>Because we tried to avoid a few seconds of write latency by keeping relaxed Flow Control, we inadvertently caused a hard crash of the primary database pod, with long write downtime.</span></p>
<h3><span>The Architectural Law</span></h3>
<p><span>Therefore, here is my statement as architectural law for containerized environments: </span><b>In Kubernetes, High Availability and Pod stability are so intrinsically linked that Flow Control </b><b><i>must</i></b><b> act as hard as it can to cap the apply queue.</b></p>
<ul>
<li aria-level="1"><span>We cannot allow unbounded memory growth in a container. The only way to bound </span><span>certification_info</span><span> memory is to bound the apply queue.</span></li>
<li aria-level="1"><span>The only way to bound the apply queue is with strict, aggressive Flow Control.</span></li>
<li aria-level="1"><span>Increasing the number of replication appliers helps but is not the conclusive answer.</span></li>
</ul>
<p><span>In a Kubernetes environment, we must tune </span><span>group_replication_flow_control_applier_threshold</span><span> to a strict, low number, and accept that during massive traffic spikes, our application </span><i><span>will</span></i><span> experience write throttling. It is infinitely better for our application’s connection pool to wait 2 seconds for a </span><span>COMMIT</span><span> to succeed than for the primary database pod to be violently OOMKilled by the kernel, and have to wait for minutes or hours to recover write capabilities.</span></p>
<h3><span>Note</span></h3>
<p><span>Just as a mention this is exactly how Percona Operator with Percona Xtradb Cluster works. To be more specific, PXC and in general solutions based on Galera have a Flow Control mechanism that enforces the queue to be inside hard limits. While this more invasive control may be noticeable at application level, it guarantees that the other nodes are not lagging behind the primary and this is why it is a stronger HA solution in the Kubernetes environment.</span></p>
<p> </p>
<h2><span>Reference</span></h2>
<p><a href="https://github.com/Tusamarco/mysqloperatorcalculator"><span>https://github.com/Tusamarco/mysqloperatorcalculator</span></a></p>
<p><span>Managing Resources and OOMKills: </span><a href="https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/"><span>Resource Management for Pods and Containers</span></a> <i><span>(This page details how memory limits are enforced reactively by the Linux kernel via OOM kills).</span></i></p>
<p><span>How WSS triggers Evictions: </span><a href="https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/"><span>Node-pressure Eviction</span></a> <i><span>(This page explicitly details how the </span></i><i><span>kubelet</span></i><i><span> uses the </span></i><i><span>memory.available</span></i><i><span> signal, which is derived from node capacity minus the working set size).</span></i></p>
<p><span>Latest changes. </span><a href="https://github.com/google/cadvisor/blob/195858077459e69455fd9621fcbaeaf377d69d0e/container/libcontainer/handler.go#L865"><span>Pointer to the code</span></a><span> </span></p>
<p><span>Swap Memory Management (Core Concepts &amp; Configuration): </span><a href="https://kubernetes.io/docs/concepts/cluster-administration/swap-memory-management/"><span>https://kubernetes.io/docs/concepts/cluster-administration/swap-memory-management/</span></a></p>
<p>The post <a href="https://www.percona.com/blog/the-failover-brownout-rethinking-high-availability-in-mysql-group-replication/">The Failover Brownout: Rethinking High Availability in MySQL Group Replication</a> appeared first on <a href="https://www.percona.com/">Percona</a>.</p>]]></content:encoded>
    <pubDate>Mon, 15 Jun 2026 06:57:24 +0000</pubDate>
    <dc:creator>Marco Tusa</dc:creator>
    <category>Benchmarks</category>
    <category>Kubernetes</category>
    <category>MySQL</category>
    <category>Operator</category>
    <category>Uncategorized</category>
    <category>flow control</category>
    <category>group replication</category>
  </item>

  <item>
    <title>Stop Guessing Your Kubernetes MySQL Configs: Meet the MySQL Operator Calculator</title>
    <guid isPermaLink="false">http://www.tusacentral.com/joomla/index.php/mysql-blogs/267-stop-guessing-your-kubernetes-mysql-configs-meet-the-mysql-operator-calculator</guid>
    <link>http://www.tusacentral.com/joomla/index.php/mysql-blogs/267-stop-guessing-your-kubernetes-mysql-configs-meet-the-mysql-operator-calculator</link>
    <description>Let’s be honest: migrating a relational database to Kubernetes sounds fantastic in a whiteboard meeting, but the reality of day-two operations is a completely different story.
When moving MySQL to Kubernetes, the ultimate goal is simple: identify a safe, performant set of configuration values for your database pods. But where do you start? Usually, you look at your overall node resources say, a machine with 16 CPUs and 64GB of RAM.
In the old bare-metal days, you'd apply the standard rules of thumb:


Set innodb_buffer_pool_size to 60-80% of total RAM to maximize caching.


Allocate 1 innodb_buffer_pool_instances per 1GB of buffer pool.


Match innodb_io_capacity to your drive speeds.


If you try applying these legacy rules in Kubernetes, your pod won't survive.
The Kubernetes Reality Check: OOMKills and Probe Traps
Why do the old rules fail? Because Kubernetes environments lack swap space. If a pod exceeds its assigned memory limit, Kubernetes executes an immediate, destructive action: an OOMKill.
Standard tuning rules don't account for the hidden memory consumers inside a K8s pod. You aren't just allocating memory for MySQL anymore; you have to share the pod's limits across running connections, the routing proxy, monitoring sidecars, and internal database processes.
For example, extensive testing reveals that Percona Server (PS) with Group Replication consumes about 9% to 11% more memory than Percona XtraDB Cluster (PXC) under the exact same load. If you blindly allocate 80% of your RAM to the buffer pool, that extra 10% overhead from Group Replication will push you right over the edge.
Memory isn't the only trap. During OLTP load testing (using sysbench TPC-C), pods can get killed before memory even peaks. The culprit? Kubernetes liveness and readiness probes. Under heavy load, a perfectly healthy database pod might take slightly longer to respond. If your probe timeouts are too short, K8s assumes the pod is dead and kills it with no questions asked.
Step 1: Discover Your Actual Resources
To avoid these pitfalls, you must know what resources you actually have before tuning anything. A 64GB node does not give you 64GB of pod memory. Cloud providers run system pods to manage the cluster, which silently consume your baseline resources.
Before applying any configurations, check your node:



Bash


kubectl describe node &amp;lt;nodename&amp;gt;




You might see something like this in the output:



Plaintext


  Resource           Requests    Limits
  --------           --------    ------
  cpu                702m (4%)   1200m (7%)
  memory             645Mi (1%)  1994Mi (3%)




In this scenario, 7% of the CPU and 3% of the memory are already spoken for. Your 16 CPUs and 64GB of RAM are actually closer to 14 CPUs and 58GB of usable memory. If you base your manual database tuning on the 64GB fantasy, you are already on a collision course with an OOMKill.
You could try to manually scale down your buffers to be &quot;safe&quot; (e.g., arbitrarily dropping the buffer pool to 50%), but then you sacrifice massive amounts of performance.
This is where the guessing game has to stop.
Enter the MySQL Operator Calculator
Built as a lightning-fast, RESTful Go service, the MySQL Operator Calculator is designed to take this exact math entirely out of your hands.
Instead of manually calculating overheads for proxies, monitors, and Group Replication, you simply feed the calculator your actual available pod resources and workload type. It dynamically computes the optimal, mathematically safe configuration parameters for your Kubernetes operator (such as the Percona Operator for MySQL).
Why You Need It in Your Toolkit:


Say Goodbye to OOM Kills: The tool mathematically balances your total allocated memory across the three critical components of a modern K8s database pod: the mysql engine, the proxy layer, and the monitor agent.


Workload-Aware Tuning: Simply tell the calculator your load type (Read-Heavy, Light OLTP, or Heavy OLTP), and it adjusts the buffers and threads accordingly.


Automation: Designed with modern infrastructure in mind, the calculator outputs clean, structured json. You can easily curl the API from your CI/CD pipelines to automatically inject calculated configurations into your Helm charts.


Auto-Calculated Connections: Not sure how many connections your memory limit can safely handle? Pass 0 for connections, and the tool will calculate the maximum safe threshold for you.


How It Works in Practice
Getting your optimized configuration is as simple as making an HTTP request. Let's say you have a heavy OLTP Percona XtraDB Cluster (PXC), you've identified you have exactly 4 CPUs and 2.5GB of RAM available, and you want the tool to figure out the max connections for MySQL 8.0.33. Just ask:



Bash


curl -i -X GET -H &quot;Content-Type: application/json&quot; -d '{
  &quot;output&quot;: &quot;human&quot;,
  &quot;dbtype&quot;: &quot;pxc&quot;,
  &quot;dimension&quot;: {
    &quot;id&quot;: 999,
    &quot;cpu&quot;: 4000,
    &quot;memory&quot;: &quot;2.5G&quot;
  },
  &quot;loadtype&quot;: {&quot;id&quot;: 3},
  &quot;connections&quot;: 0,
  &quot;mysqlversion&quot;: {&quot;major&quot;: 8, &quot;minor&quot;: 0, &quot;patch&quot;: 33}
}' http://127.0.0.1:8080/calculator




Using the human output flag gives you a highly readable, my.cnf-style output, while the json flag provides structured data detailing the exact configuration section, calculated value, and the safe minimums/maximums used in the background math.
Ready to Stop Guessing?
Container orchestration is complex enough without having to manually calculate memory overheads on a calculator app at 2:00 AM during an outage. By programmatically determining your limits, you ensure your database remains stable, performant, and perfectly sized for its environment.
This is why I developed this tool, initially for my personal use, but I think it can be useful to others, so here we go:
Check out the source code, compile the binary, and start optimizing your clusters today by visiting the MySQL Operator Calculator on GitHub.
Or you can try it using the service at tusacentral.net:8080 like:



curl -i -X GET -H &quot;Content-Type: application/json&quot; -d '{
&quot;output&quot;:&quot;human&quot;,
&quot;dbtype&quot;:&quot;group_replication&quot;, 
&quot;dimension&quot;: {&quot;id&quot;: 999, &quot;cpu&quot;: 16000, &quot;memory&quot;: &quot;64G&quot;}, 
&quot;loadtype&quot;: {&quot;id&quot;: 3}, 
&quot;connections&quot;:1500,
&quot;mysqlversion&quot;:{&quot;major&quot;:8,&quot;minor&quot;:4,&quot;patch&quot;:8},
&quot;providercostpct&quot;:0.10}' http://tusacentral.net:8080/calculator





This is just for demo and cannot be used as reference for a service, please build your own server for that.
 
Of course the use of the settings generated is at your own risk, I am not taking any responsability in case they are not working, so test them over and over and see if they match your needs.
Also read the recent blogs https://tusacentral.net/joomla/index.php/mysql-blogs/265-group-replication-vs-percona-xtradb-cluster-the-true-cost-of-consistency and https://tusacentral.net/joomla/index.php/mysql-blogs/266-the-failover-brownout-rethinking-high-availability-in-mysql-group-replicationthey are VERY important to understand what is going on in the operator especially the one using Grup Replication.
PR or issue requests are welcome.
</description>
    <content:encoded><![CDATA[<p data-path-to-node="3">Let’s be honest: migrating a relational database to Kubernetes sounds fantastic in a whiteboard meeting, but the reality of day-two operations is a completely different story.</p>
<p data-path-to-node="4">When moving MySQL to Kubernetes, the ultimate goal is simple: identify a safe, performant set of configuration values for your database pods. But where do you start? Usually, you look at your overall node resources say, a machine with 16 CPUs and 64GB of RAM.<img src="http://www.tusacentral.com/joomla/images/stories/varie/front_image_calculator.png" alt="front image calculator" width="415" height="226"></p>
<p data-path-to-node="5">In the old bare-metal days, you'd apply the standard rules of thumb:</p>
<ul data-path-to-node="6">
<li>
<p data-path-to-node="6,0,0">Set <code data-path-to-node="6,0,0" data-index-in-node="4">innodb_buffer_pool_size</code> to 60-80% of total RAM to maximize caching.</p>
</li>
<li>
<p data-path-to-node="6,1,0">Allocate 1 <code data-path-to-node="6,1,0" data-index-in-node="11">innodb_buffer_pool_instances</code> per 1GB of buffer pool.</p>
</li>
<li>
<p data-path-to-node="6,2,0">Match <code data-path-to-node="6,2,0" data-index-in-node="6">innodb_io_capacity</code> to your drive speeds.</p>
</li>
</ul>
<p data-path-to-node="7">If you try applying these legacy rules in Kubernetes, your pod won't survive.</p>
<h3 data-path-to-node="8">The Kubernetes Reality Check: OOMKills and Probe Traps</h3>
<p data-path-to-node="9">Why do the old rules fail? Because Kubernetes environments lack swap space. If a pod exceeds its assigned memory limit, Kubernetes executes an immediate, destructive action: an <b data-path-to-node="9" data-index-in-node="177">OOMKill</b>.</p>
<p data-path-to-node="10">Standard tuning rules don't account for the hidden memory consumers inside a K8s pod. You aren't just allocating memory for MySQL anymore; you have to share the pod's limits across running connections, the routing proxy, monitoring sidecars, and internal database processes.</p>
<p data-path-to-node="11">For example, extensive testing reveals that Percona Server (PS) with Group Replication consumes about 9% to 11% more memory than Percona XtraDB Cluster (PXC) under the exact same load. If you blindly allocate 80% of your RAM to the buffer pool, that extra 10% overhead from Group Replication will push you right over the edge.</p>
<p data-path-to-node="12">Memory isn't the only trap. During OLTP load testing (using sysbench TPC-C), pods can get killed before memory even peaks. The culprit? <b data-path-to-node="12" data-index-in-node="136">Kubernetes liveness and readiness probes.</b> Under heavy load, a perfectly healthy database pod might take slightly longer to respond. If your probe timeouts are too short, K8s assumes the pod is dead and kills it with no questions asked.</p>
<h3 data-path-to-node="13">Step 1: Discover Your <i data-path-to-node="13" data-index-in-node="22">Actual</i> Resources</h3>
<p data-path-to-node="14">To avoid these pitfalls, you must know what resources you actually have before tuning anything. A 64GB node does not give you 64GB of pod memory. Cloud providers run system pods to manage the cluster, which silently consume your baseline resources.</p>
<p data-path-to-node="15">Before applying any configurations, check your node:</p>
<div class="code-block ng-tns-c3678280791-194 ng-animate-disabled ng-trigger ng-trigger-codeBlockRevealAnimation">
<div class="formatted-code-block-internal-container ng-tns-c3678280791-194">
<div class="animated-opacity ng-tns-c3678280791-194">
<div class="code-block-decoration header-formatted gds-emphasized-body-m ng-tns-c3678280791-194 ng-star-inserted"><span class="ng-tns-c3678280791-194">Bash</span>
<div class="buttons ng-tns-c3678280791-194 ng-star-inserted"></div>
</div>
<pre class="ng-tns-c3678280791-194" _ngcontent-ng-c3678280791><code class="code-container formatted ng-tns-c3678280791-194" _ngcontent-ng-c3678280791 role="text" data-test-id="code-content">kubectl describe node &lt;nodename&gt;
</code></pre>
</div>
</div>
</div>
<p data-path-to-node="17">You might see something like this in the output:</p>
<div class="code-block ng-tns-c3678280791-195 ng-animate-disabled ng-trigger ng-trigger-codeBlockRevealAnimation">
<div class="formatted-code-block-internal-container ng-tns-c3678280791-195">
<div class="animated-opacity ng-tns-c3678280791-195">
<div class="code-block-decoration header-formatted gds-emphasized-body-m ng-tns-c3678280791-195 ng-star-inserted"><span class="ng-tns-c3678280791-195">Plaintext</span>
<div class="buttons ng-tns-c3678280791-195 ng-star-inserted"></div>
</div>
<pre class="ng-tns-c3678280791-195" _ngcontent-ng-c3678280791><code class="code-container formatted ng-tns-c3678280791-195" _ngcontent-ng-c3678280791 role="text" data-test-id="code-content">  Resource           Requests    Limits
  --------           --------    ------
  cpu                702m (4%)   1200m (7%)
  memory             645Mi (1%)  1994Mi (3%)
</code></pre>
</div>
</div>
</div>
<p data-path-to-node="19">In this scenario, 7% of the CPU and 3% of the memory are already spoken for. Your 16 CPUs and 64GB of RAM are actually closer to <b data-path-to-node="19" data-index-in-node="129">14 CPUs and 58GB of usable memory</b>. If you base your manual database tuning on the 64GB fantasy, you are already on a collision course with an OOMKill.</p>
<p data-path-to-node="20">You <i data-path-to-node="20" data-index-in-node="4">could</i> try to manually scale down your buffers to be "safe" (e.g., arbitrarily dropping the buffer pool to 50%), but then you sacrifice massive amounts of performance.</p>
<p data-path-to-node="21">This is where the guessing game has to stop.</p>
<h3 data-path-to-node="22">Enter the MySQL Operator Calculator</h3>
<p data-path-to-node="23">Built as a lightning-fast, RESTful Go service, the <b data-path-to-node="23" data-index-in-node="51">MySQL Operator Calculator</b> is designed to take this exact math entirely out of your hands.</p>
<p data-path-to-node="24">Instead of manually calculating overheads for proxies, monitors, and Group Replication, you simply feed the calculator your <i data-path-to-node="24" data-index-in-node="124">actual</i> available pod resources and workload type. It dynamically computes the optimal, mathematically safe configuration parameters for your Kubernetes operator (such as the Percona Operator for MySQL).</p>
<p data-path-to-node="25"><b data-path-to-node="25" data-index-in-node="0">Why You Need It in Your Toolkit:</b></p>
<ul data-path-to-node="26">
<li>
<p data-path-to-node="26,0,0"><b data-path-to-node="26,0,0" data-index-in-node="0">Say Goodbye to OOM Kills:</b> The tool mathematically balances your total allocated memory across the three critical components of a modern K8s database pod: the <b data-path-to-node="26,0,0" data-index-in-node="158">mysql</b> engine, the <b data-path-to-node="26,0,0" data-index-in-node="176">proxy</b> layer, and the <b data-path-to-node="26,0,0" data-index-in-node="197">monitor</b> agent.</p>
</li>
<li>
<p data-path-to-node="26,1,0"><b data-path-to-node="26,1,0" data-index-in-node="0">Workload-Aware Tuning:</b> Simply tell the calculator your load type (Read-Heavy, Light OLTP, or Heavy OLTP), and it adjusts the buffers and threads accordingly.</p>
</li>
<li>
<p data-path-to-node="26,2,0"><b data-path-to-node="26,2,0" data-index-in-node="0">Automation:</b> Designed with modern infrastructure in mind, the calculator outputs clean, structured <code data-path-to-node="26,2,0" data-index-in-node="113">json</code>. You can easily curl the API from your CI/CD pipelines to automatically inject calculated configurations into your Helm charts.</p>
</li>
<li>
<p data-path-to-node="26,3,0"><b data-path-to-node="26,3,0" data-index-in-node="0">Auto-Calculated Connections:</b> Not sure how many connections your memory limit can safely handle? Pass <code data-path-to-node="26,3,0" data-index-in-node="101">0</code> for connections, and the tool will calculate the maximum safe threshold for you.</p>
</li>
</ul>
<h3 data-path-to-node="27">How It Works in Practice</h3>
<p data-path-to-node="28">Getting your optimized configuration is as simple as making an HTTP request. Let's say you have a heavy OLTP Percona XtraDB Cluster (PXC), you've identified you have exactly 4 CPUs and 2.5GB of RAM available, and you want the tool to figure out the max connections for MySQL 8.0.33. Just ask:</p>
<div class="code-block ng-tns-c3678280791-196 ng-animate-disabled ng-trigger ng-trigger-codeBlockRevealAnimation">
<div class="formatted-code-block-internal-container ng-tns-c3678280791-196">
<div class="animated-opacity ng-tns-c3678280791-196">
<div class="code-block-decoration header-formatted gds-emphasized-body-m ng-tns-c3678280791-196 ng-star-inserted"><span class="ng-tns-c3678280791-196">Bash</span>
<div class="buttons ng-tns-c3678280791-196 ng-star-inserted"></div>
</div>
<pre class="ng-tns-c3678280791-196" _ngcontent-ng-c3678280791><code class="code-container formatted ng-tns-c3678280791-196" _ngcontent-ng-c3678280791 role="text" data-test-id="code-content">curl -i -X GET -H <span class="hljs-string">"Content-Type: application/json"</span> -d <span class="hljs-string">'{
  "output": "human",
  "dbtype": "pxc",
  "dimension": {
    "id": 999,
    "cpu": 4000,
    "memory": "2.5G"
  },
  "loadtype": {"id": 3},
  "connections": 0,
  "mysqlversion": {"major": 8, "minor": 0, "patch": 33}
}'</span> http://127.0.0.1:8080/calculator
</code></pre>
</div>
</div>
</div>
<p data-path-to-node="30">Using the <code data-path-to-node="30" data-index-in-node="10">human</code> output flag gives you a highly readable, <code data-path-to-node="30" data-index-in-node="57">my.cnf</code>-style output, while the <code data-path-to-node="30" data-index-in-node="88">json</code> flag provides structured data detailing the exact configuration section, calculated value, and the safe minimums/maximums used in the background math.</p>
<h3 data-path-to-node="31">Ready to Stop Guessing?</h3>
<p data-path-to-node="32">Container orchestration is complex enough without having to manually calculate memory overheads on a calculator app at 2:00 AM during an outage. By programmatically determining your limits, you ensure your database remains stable, performant, and perfectly sized for its environment.</p>
<p>This is why I developed this tool, initially for my personal use, but I think it can be useful to others, so here we go:</p>
<p data-path-to-node="33">Check out the source code, compile the binary, and start optimizing your clusters today by visiting the <a href="https://github.com/Tusamarco/mysqloperatorcalculator" target="_blank" rel="noopener" class="ng-star-inserted">MySQL Operator Calculator on GitHub</a>.</p>
<p>Or you <strong>can try it</strong> using the service at tusacentral.net:8080 like:</p>
<div class="code-block ng-tns-c3678280791-196 ng-animate-disabled ng-trigger ng-trigger-codeBlockRevealAnimation">
<div class="formatted-code-block-internal-container ng-tns-c3678280791-196">
<div class="animated-opacity ng-tns-c3678280791-196">
<pre class="ng-tns-c3678280791-196" _ngcontent-ng-c3678280791><code class="code-container formatted ng-tns-c3678280791-196" _ngcontent-ng-c3678280791 role="text" data-test-id="code-content">curl -i -X GET -H "Content-Type: application/json" -d '{</code></pre>
<pre class="ng-tns-c3678280791-196" _ngcontent-ng-c3678280791><code class="code-container formatted ng-tns-c3678280791-196" _ngcontent-ng-c3678280791 role="text" data-test-id="code-content">"output":"human",</code></pre>
<pre class="ng-tns-c3678280791-196" _ngcontent-ng-c3678280791><code class="code-container formatted ng-tns-c3678280791-196" _ngcontent-ng-c3678280791 role="text" data-test-id="code-content">"dbtype":"group_replication", </code></pre>
<pre class="ng-tns-c3678280791-196" _ngcontent-ng-c3678280791><code class="code-container formatted ng-tns-c3678280791-196" _ngcontent-ng-c3678280791 role="text" data-test-id="code-content">"dimension": {"id": 999, "cpu": 16000, "memory": "64G"}, </code></pre>
<pre class="ng-tns-c3678280791-196" _ngcontent-ng-c3678280791><code class="code-container formatted ng-tns-c3678280791-196" _ngcontent-ng-c3678280791 role="text" data-test-id="code-content">"loadtype": {"id": 3}, </code></pre>
<pre class="ng-tns-c3678280791-196" _ngcontent-ng-c3678280791><code class="code-container formatted ng-tns-c3678280791-196" _ngcontent-ng-c3678280791 role="text" data-test-id="code-content">"connections":1500,</code></pre>
<pre class="ng-tns-c3678280791-196" _ngcontent-ng-c3678280791><code class="code-container formatted ng-tns-c3678280791-196" _ngcontent-ng-c3678280791 role="text" data-test-id="code-content">"mysqlversion":{"major":8,"minor":4,"patch":8},</code></pre>
<pre class="ng-tns-c3678280791-196" _ngcontent-ng-c3678280791><code class="code-container formatted ng-tns-c3678280791-196" _ngcontent-ng-c3678280791 role="text" data-test-id="code-content">"providercostpct":0.10}' http://tusacentral.net:8080/calculator
</code></pre>
</div>
</div>
</div>
<p data-path-to-node="30"></p>
<p>This is just for demo and cannot be used as reference for a service, please build your own server for that.</p>
<p> </p>
<p>Of course <strong>the use of the settings generated is at your own risk, I am not taking any responsability in case they are not working</strong>, so test them over and over and see if they match your needs.<br><br></p>
<p>Also read the recent blogs <a href="http://www.tusacentral.com/joomla/index.php/mysql-blogs/265-group-replication-vs-percona-xtradb-cluster-the-true-cost-of-consistency">https://tusacentral.net/joomla/index.php/mysql-blogs/265-group-replication-vs-percona-xtradb-cluster-the-true-cost-of-consistency</a> and <a href="http://www.tusacentral.com/joomla/index.php/mysql-blogs/266-the-failover-brownout-rethinking-high-availability-in-mysql-group-replication">https://tusacentral.net/joomla/index.php/mysql-blogs/266-the-failover-brownout-rethinking-high-availability-in-mysql-group-replication</a><br>they are VERY important to understand what is going on in the operator especially the one using Grup Replication.</p>
<p>PR or issue requests are welcome.</p>
<p data-path-to-node="35"></p>]]></content:encoded>
    <pubDate>Sat, 13 Jun 2026 10:46:13 +0000</pubDate>
    <dc:creator>Marco Tusa</dc:creator>
    <category>MySQL</category>
  </item>

  <item>
    <title>The Failover Brownout: Rethinking High Availability in MySQL Group Replication</title>
    <guid isPermaLink="false">http://www.tusacentral.com/joomla/index.php/mysql-blogs/266-the-failover-brownout-rethinking-high-availability-in-mysql-group-replication</guid>
    <link>http://www.tusacentral.com/joomla/index.php/mysql-blogs/266-the-failover-brownout-rethinking-high-availability-in-mysql-group-replication</link>
    <description>It is time to talk again about Flow control and group replication. This time with a special eye on the use of Group Replication in the Kubernetes context. In this article we will dig a bit on how it works and what are the various side effects. 
 
The problem
Recently I was refining the calculation I use in the MySQL calculator for Operator given I was constantly encountering a very serious problem with the Percona Server Operator.
The problem is that when the deployment was/is serving a high level of traffic, it will, no matter what, end up in getting OMMKill by the K8 system. 
This because the pod was gradually consuming more and more memory, reaching the memory limit set in the CR specification. 
 
Now let me clarify a few things, to get straight to the facts.
Kubernetes itself does not OOMKill a pod for hitting its memory limit, the mechanism works as described below with mention on how Working Set Size (WSS) is calculated, and how OOMKills are triggered, and in the resource sections, the links to the official documentation and source code.
 
1. The Reality of OOMKills vs. Kubelet Evictions
It is crucial to distinguish between what the Linux kernel does and what Kubernetes does:

OOMKilled (Exit Code 137): This is executed entirely by the Linux kernel's OOM Killer, not Kubernetes. When we set a memory limit in our Pod spec, Kubernetes translates that into a Linux cgroup constraint (memory.limit_in_bytes for cgroups v1, or memory.max for cgroups v2). If our container attempts to allocate more memory than this hard limit, and the kernel cannot reclaim any page cache (like inactive files), the kernel directly intervenes and terminates the process.
Node-Pressure Evictions: This is where Kubernetes actively observes memory. The kubelet monitors the working_set_bytes metric to protect the node from running out of memory. If the node's memory drops below an eviction threshold, Kubernetes will actively evict pods to prevent the kernel from initiating a system-wide OOM kill.

2. How Working Set Size (WSS) is Calculated for the container
Kubernetes monitors container memory via cAdvisor, which is integrated directly into the kubelet. cAdvisor calculates the Working Set Size by taking the total memory usage and subtracting the inactive file cache (memory that the kernel can easily reclaim if it faces memory pressure).
Because active file caches and anonymous memory (like our application's heap) cannot be easily evicted, this working set metric is the most accurate representation of the memory your container is forcing the system to hold.
 
The Calculation &amp;amp; cgroups Evolution The core mathematical calculation is Memory Usage - Inactive File Cache, but how cAdvisor fetches this data from the Linux kernel depends entirely on your node's cgroup version. Modern cAdvisor relies heavily on the opencontainers/runc/libcontainer library to read these raw cgroup files:

cgroups v1: cAdvisor starts with the raw usage from memory.usage_in_bytes and subtracts the reclaimable cache found under the total_inactive_file key.
cgroups v2 (Unified): cAdvisor starts with the raw usage from memory.current and subtracts the reclaimable cache found under the inactive_file key.

 
The Underlying Code Logic While older versions used a static setMemoryStats function, modern Kubernetes branches handle this dynamically. The logic executes the following flow before reporting back to the kubelet:

Detects Version: It identifies whether the node runs cgroups v1 or v2 to determine the correct inactive file key name.
Fetch Usage: It pulls the raw memory usage from the container.
Subtract Cache: It looks up the inactive file value and safely subtracts it from the usage (including a safeguard to ensure the working set never drops below zero).
Report Metric: It sets this final calculated value as container_memory_working_set_bytes, which the kubelet then uses to decide if the node is under memory pressure.

Back to us 
At the end the point is that if our pod reaches the limit and we ARE NOT using the new swap feature existing in Kubernetes, our pod will be brutally killed, and in 99% of the cases our production will suffer a lot. !Ops spoiler!
 
To clearly understand what was causing the issue about this memory consumption and having my calculator fail, I started to collect the information about the memory usage in MySQL itself.
 
SELECT EVENT_NAME,CURRENT_NUMBER_OF_BYTES_USED / 1024 / 1024 AS current_usage_mb FROM performance_schema.memory_summary_global_by_event_name WHERE EVENT_NAME like 'memory/%' and EVENT_NAME not like 'memory/performance%'  order by current_usage_mb desc limit 25;
Which will give you and output like this:
+---------------------------------------+------------------+
| EVENT_NAME                            | current_usage_mb |
+---------------------------------------+------------------+
| memory/innodb/buf_buf_pool            |   46398.92578125 |
| memory/group_rpl/GCS_XCom::xcom_cache |    1066.66179943 |
| memory/group_rpl/certification_info   |      92.45250702 |
| memory/innodb/log_buffer_memory       |      64.00096130 |
| memory/sql/TABLE                      |      49.90627003 |
| memory/innodb/memory                  |      34.68734741 |
| memory/innodb/ut0link_buf             |      24.00006104 |
| memory/innodb/lock0lock               |      21.40064240 |
| memory/mysqld_openssl/openssl_malloc  |       9.51009655 |
| memory/innodb/read0read               |       8.19496155 |
| memory/mysys/KEY_CACHE                |       8.00215149 |
| memory/innodb/sync0arr                |       7.03147125 |
| memory/innodb/ha_innodb               |       6.87006950 |
| memory/innodb/lock_sys                |       5.25009155 |
| memory/sql/log_sink_pfs               |       5.00003052 |
| memory/innodb/ut0pool                 |       4.00017548 |
| memory/sql/dd::objects                |       2.83031464 |
| memory/innodb/std                     |       2.72618866 |
| memory/innodb/os0file                 |       2.63054657 |
| memory/innodb/os0event                |       2.34302521 |
| memory/sql/TABLE_SHARE::mem_root      |       2.31734467 |
| memory/innodb/trx0trx                 |       2.22647858 |
| memory/temptable/physical_ram         |       1.00003052 |
| memory/sql/dd::String_type            |       0.94942093 |
| memory/innodb/btr0pcur                |       0.89743423 |
+---------------------------------------+------------------+

 
Plus I used PMM to collect memory information 

To simulate the load I used the sysbench-tpcc (tpc-c derivate test) variant and run the tests simulating a load of 1024 threads against a cluster based on machine with 16 Core and 64Gb volumes ~3k IOPS, so not gigantic but not small. 
 
The finding was almost immediate:
+---------------------------------------+------------------+
| EVENT_NAME                            | current_usage_mb |
+---------------------------------------+------------------+
| memory/innodb/buf_buf_pool            |   46398.92578125 |
| memory/group_rpl/certification_info   |    1431.67934418 | &amp;lt;constantly increasing
| memory/group_rpl/GCS_XCom::xcom_cache |    1066.63542366 |
| memory/sql/Gtid_set::Interval_chunk   |      95.52413940 |
| memory/innodb/log_buffer_memory       |      64.00096130 |
| memory/sql/TABLE                      |      48.17613125 |
| memory/innodb/memory                  |      35.08897400 |
| memory/innodb/ut0link_buf             |      24.00006104 |
| memory/innodb/lock0lock               |      21.40064240 |
| memory/innodb/read0read               |      14.86782837 |
| memory/mysqld_openssl/openssl_malloc  |      12.05916119 |
| memory/mysys/KEY_CACHE                |       8.00215149 |
| memory/innodb/sync0arr                |       7.03147125 |
| memory/innodb/ha_innodb               |       6.84074974 |
| memory/innodb/lock_sys                |       5.25009155 |
| memory/sql/log_sink_pfs               |       5.00003052 |
| memory/innodb/ut0pool                 |       4.00017548 |
| memory/sql/dd::objects                |       2.82012177 |
| memory/innodb/std                     |       2.72515869 |
| memory/innodb/os0file                 |       2.63054657 |
| memory/innodb/os0event                |       2.35884857 |
| memory/innodb/trx0trx                 |       2.22647858 |
| memory/sql/TABLE_SHARE::mem_root      |       1.83777618 |
| memory/innodb/trx0undo                |       1.26304626 |
| memory/mysys/lf_node                  |       1.08828735 |
+---------------------------------------+------------------+

 
 Ok then … What is the certification info???
What is group_rpl/certification_info?
In MySQL, memory/group_rpl/certification_info is a Performance Schema memory instrument. It tracks the exact amount of RAM allocated to store the Certification Database (or Certification Info).
In Group Replication, nodes do not lock rows across the network while a transaction is executing. Instead, transactions execute locally and optimistically. When it is time to commit, the transaction undergoes a Certification Process to ensure no other concurrent transaction in the cluster has modified the exact same rows. The certification_info buffer is the in-memory hash map that makes this conflict detection possible.
1. What is it used for?
The certification_info structure acts as a tracking ledger for recently modified rows.
Here is how it works under the hood:

The Key-Value Pair: It is fundamentally an in-memory dictionary. The key is the hash of a modified row (extracted from the transaction's &quot;write set&quot;), and the value is the Global Transaction Identifier (GTID) of the transaction that successfully modified it.
Conflict Detection: When a new transaction attempts to commit, it broadcasts its write set and the &quot;snapshot version&quot; of the database it saw when it started. The certifier cross-references the incoming transaction's write set against the certification_info map.
The Decision: If the certification_info shows that a row was modified by a newer GTID that the incoming transaction did not &quot;see&quot; when it started, a conflict is flagged, and the transaction is aborted. If no conflict exists, the transaction is certified, and the certification_info map is updated with the new write set and GTID.

The primary does not hold onto this memory out of stubbornness; it does so because purging that data too early would destroy the cluster's consistency in the event of a failover.
 
In Group Replication, garbage collection for the certification_info buffer is not triggered just because a transaction commits on the primary. It is triggered by a concept called the Stable Set. 
Every node in the cluster periodically broadcasts a message to the rest of the group saying, &quot;Here are the GTIDs I have successfully applied to my disk.&quot; The cluster then calculates a global low watermark. This watermark is the highest transaction GTID that every single member of the group has successfully applied. Garbage collection is only allowed to purge write-sets from the certification database that fall below this global watermark.  To note that this purge is a synchronous operation during which writes are forbidden.
2. How the Apply Queue Stalls the Watermark
When a secondary node starts lagging, its applier queue grows. This means the secondary is receiving transactions from the network quickly, but its SQL thread is too slow to actually execute them and commit them to disk.
Because the secondary hasn't applied these transactions, it cannot report those GTIDs back to the group as &quot;finished.&quot;

The lagging secondary's local watermark stalls.
Therefore, the global low watermark for the entire cluster stalls.
Because the global watermark hasn't moved forward, the garbage_collect function on the primary (and all other nodes) says, &quot;I am not allowed to delete any write-sets yet.&quot;
As the primary continues to process new writes, the certification_info memory buffer grows continuously.

3. Why the Primary Cannot Purge Early
we might wonder: If the transaction is already committed on the primary, why does the primary care if the secondary has applied it? Why not just drop the write-set from its own memory?
The answer comes down to Failover Safety and Distributed Conflict Detection. GR is a shared-nothing, decentralized architecture. Even if you are running in Single-Primary  mode (keep this in mind will be important later), the underlying engine uses the exact same logic as Multi-Primary mode. 
Here is why the primary is forbidden from purging that data:

The Failover Scenario: Imagine our primary node crashes right now. The lagging secondary (which still has a massive apply queue) is immediately elected as the new primary.
The Conflict Risk: As the new primary, it starts accepting new writes from your application. However, it still has thousands of old transactions in its applier queue that it hasn't written to disk yet!
The Necessity of the Buffer: When a new write comes in, the new primary must check if that write conflicts with any of the pending transactions in its apply queue. It does this by checking the certification_info map. If the old primary had purged the global certification data early, the new primary wouldn't have the write-sets for those pending transactions. It would blindly accept the new write, causing a massive data conflict and breaking the replication group entirely.

Fine Marco, then what is the effect of this?
 
Well, drums roll …
… When a secondary node is elected as the new primary during a failover, it does not immediately open the floodgates to new writes. It keeps its super_read_only variable set to ON until it has completely drained its local apply queue of all transactions that were certified prior to the election.
This is an intentional design choice to guarantee that the new primary's state is completely consistent with the old primary before it starts accepting new data.
 
4. Immediate Write Rejections (No Built-in Queuing)
The most critical impact to understand is that the new primary does not queue or pause new incoming writes while it catches up. It outright rejects them.
If our application or proxy routes a COMMIT, INSERT, UPDATE, or DELETE to the new primary while it is still processing the old queue, MySQL will immediately throw an error back to the client:
ERROR 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement
5. The &quot;Brownout&quot; Window (Write Outage)
Because of this behavior, a failover in MySQL Group Replication does not instantly restore write availability. Our cluster experiences a &quot;brownout&quot;, a period where reads might succeed, but writes are entirely blocked.
The duration of this write outage is directly proportional to the size of the apply queue.

If the secondary was fully caught up, write availability is restored in milliseconds.
If the secondary was lagging by 50 minutes, your application will suffer a 50 minute write outage while the node applies the backlog.

6. Impact on Proxies (e.g., MySQL Router or ProxySQL)
If we are using a proxy layer to route your database traffic, the apply queue dictates how the proxy behaves during the transition:

MySQL Router: It continuously monitors the cluster topology and the super_read_only flag. Even though the node has technically been elected primary, Router will not open the read-write port to it until the apply queue drains and super_read_only flips to OFF. Depending on your application timeouts, client connections will either hang waiting for a writable connection or fail completely.
ProxySQL: Similar to Router, if it is configured to check for the read_only state, it will temporarily quarantine the new primary from the write hostgroup.
HAProxy (in Operator): Monitor both Primary state and read_only state, but it expose the Primary to writes causing the application to fail (bug we need to fix)  

7. Read Traffic and Stale Data
During this catch-up phase, the node will accept incoming SELECT queries (since it is still a valid database). However, because it is actively churning through the old primary's backlog, the data being read is temporarily stale.
If your application reads a row that is sitting in the apply queue but hasn't been committed to disk yet, it will get the old version of that row.
Why Flow Control is Critical
Because a large apply queue turns a seamless failover into a severe, application-breaking write outage, Group Replication includes the Flow Control feature.
Flow Control monitors the size of the apply queues across all secondaries. If a secondary starts lagging too far behind, Flow Control should actively throttle the write throughput on the current primary to allow the lagging node to catch up. It is essentially a trade-off: we accept a slight performance hit during normal operations to guarantee that your database recovers almost instantly during a failover.
However, this is not what really happens.
1. It is Reactive, Not Proactive (The Polling Blind Spot)
Flow control does not intercept and evaluate every single transaction in real-time. Instead, it relies on a periodic polling interval governed by group_replication_flow_control_period (which defaults to 1 second).
Once a second, the cluster checks the size of the apply queues and the certifier queues.

The Vulnerability: If our application generates a massive spike of 50,000 writes in 500 milliseconds, the primary will happily accept and certify all of them. Flow control will not even notice the spike until the next 1 second polling interval hits. By the time it decides to apply a throttle, the damage is already done, and the secondary's queue is already overflowing.

2. The PID Controller's &quot;Soft Brake&quot; Math
When flow control does decide to throttle, it does not simply freeze the primary. It uses a PID (Proportional-Integral-Derivative) controller algorithm to calculate a &quot;write quota&quot; (the maximum number of transactions the primary is allowed to commit in the next second).
The PID controller is deliberately tuned to be gentle. It wants to gracefully degrade performance rather than cause immediate application timeouts.

When the secondary's queue breaches the group_replication_flow_control_applier_threshold (default 25,000 transactions), the PID controller reduces the primary's quota incrementally.
The Failure Point: If the primary's incoming write rate is astronomically higher than the secondary's disk IO capacity, this incremental &quot;step down&quot; in the quota is too slow. The primary is still allowed to write, say, 10,000 transactions per second, while the secondary is only applying 2,000. The queue continues to grow aggressively despite the throttle being &quot;active.&quot;

3. The Concurrency Mismatch (Parallel vs. Serial)
This is often the silent killer that defeats flow control. Flow control makes mathematical assumptions about how fast the secondary should be able to apply transactions based on recent history.
However, the primary node might be executing writes using hundreds of highly concurrent threads. The secondary relies on the parallel applier to keep up. If the incoming workload suddenly includes transactions that cannot be parallelized, such as writes hitting overlapping rows, cascading foreign key updates, or DDL statements, the secondary's applier instantly drops from executing in parallel down to a single, serialized thread.
When this serialization happens, the secondary's applier rate plummets instantly. Flow control, which only checks in once a second and adjusts gradually, cannot brake the primary fast enough to compensate for the secondary suddenly dropping to a crawl.
What can we do?
At the moment of writing there are only two things that can be done.

Make Flow control more aggressive
Increase the number of replication appliers

 
1. Making Flow Control More Aggressive
We can configure Flow Control to be a bit more aggressive. It will still remain a suggestion but a strong one.
How it works (The Configuration):

Lower the Threshold: By reducing group_replication_flow_control_applier_threshold (default is 25,000) to something like 1,000 or 500, we force the PID controller to kick in almost immediately when a spike occurs.
Remove the Safety Net: By keeping  group_replication_flow_control_min_quota to 0 (default), we remove the minimum write guarantee. If the secondary falls behind, Flow Control is allowed to throttle the primary's writes down to zero, also if this will never happen.
Increase the Sensitivity: We can tweak the PID controller's math (using the derivative and proportional tuning variables) to react much more aggressively to queue growth.        group_replication_flow_control_hold_percent=100        group_replication_flow_control_release_percent=5

 
The reality check, does it work?:
If the expectation is to have a rigid control over the applier queue on the lagging secondary, then the answer is NO. No matter what, at the moment flow control is not designed to act as we are used to in PXC (Percona Xtradb Cluster), where we have a rigid control of the pending queue also at the cost of delaying the writes. In Group Replication  the Flow Control will never bring the write to 0, the unfortunate aspect is that the mechanism is not enough to keep the queue under control.
 
2. Increasing Replication Appliers 
To help the secondary chew through the queue faster, we can increase the number of parallel threads it uses to write to disk.
How it works: We can increase the replica_parallel_workers (formerly slave_parallel_workers) setting. GR is exceptionally smart about this. Because of the certification process we discussed earlier, GR already knows exactly which transactions modify which rows. It uses a writeset-based dependency tracker to safely hand off non-conflicting transactions to multiple worker threads simultaneously. The formula that is normally used to calculate the number of replication workers is to set 2.5 workers for each available core. IE if we have 14000m CPUs in our CR (K8) then we can assign ~35 workers, this is definitely higher than the default value of 4.   
The reality check, does it work?:  Yes, but only if our workload allows it.

The Catch - The Serialization Wall: Parallel appliers only work if the transactions do not conflict. If our application has 50 concurrent threads all trying to update the same &quot;inventory count&quot; row, or updating a highly contentious table, those transactions cannot be parallelized. The secondary's coordinator thread will see the row-level conflicts and force those transactions to wait in line and execute sequentially. We could allocate 128 parallel workers, but 127 of them will sit idle while one thread does all the work.
The Catch - Context Switching: More threads do not magically create more disk IOPS. If we set the workers too high (e.g., beyond the physical CPU core count or disk IO capacity), the secondary's InnoDB engine will spend more time context-switching and fighting over internal mutex locks than actually committing data. In many cases, over-allocating parallel workers actually slows down the apply rate.

Do we have any conclusions?
1. If HA is the goal, enforce Strict Flow Control
If our absolute top priority is High Availability, specifically achieving a near-zero Recovery Time Objective (RTO), we must configure an aggressive flow control.

The Logic: Fast failovers require small apply queues. To guarantee a small apply queue, we must strictly throttle the primary the millisecond the secondary starts to lag.
The Trade-off: we are protecting the cluster's failover readiness at the expense of application write latency. If there is a massive write spike, our application will face timeouts and connection errors, but if the primary server suddenly catches fire, our database will recover and elect a new primary almost instantly.

The problem is that Group Replication is not able to act like that today, this is something we eventually need to implement to have better HA.
2. If Performance is the goal, relax Flow Control
If our top priority is keeping the application fast and ensuring COMMIT latencies remain extremely low, we should relax flow control or rely on the generous defaults.

The Logic: By relaxing flow control, we allow the primary to run at the absolute maximum speed its local disks and CPU allow. It does not care if the secondaries fall behind. Our application users remain happy and experience zero throttling.
The Trade-off: We are accepting severe risks to your HA posture. If the primary crashes while the secondaries have a massive apply queue, we will suffer a long write outage (the brownout) while the new primary catches up. Additionally, we are accepting the risk that the certification_info memory buffer will grow significantly on the primary and eventually have the pod OOMKilled .

3. Is this not what Asynchronous replication with semy-sync offers?
 
1. The Similarities
If we look purely at how a single transaction flows and how a failover behaves, GR and Semi-Sync look like twins:

The Durability Guarantee: Semi-Sync: The primary waits to commit until at least one secondary confirms it has received the transaction and written it to its local Relay Log. 

GR: The primary waits to commit until a majority quorum of nodes confirm they have received the transaction, certified it, and written it to their local relay logs.


The Failover Delay (The Queue):  In both systems, the secondary receiving the data does not mean the secondary has applied the data to its InnoDB tables.

If a crash happens, both systems require the new primary to completely execute its pending queue (Relay Log for Semi-Sync, Apply Queue for GR) before it is safe to accept new writes.



2. The Crucial Differences
If they behave so similarly, why use GR at all?  The differences lie entirely in automation, consensus, and split-brain protection. Semi-Sync is just a data transport mechanism; GR is a full state-machine cluster.
Here is what GR gives you that Semi-Sync does not:

Automatic Election and Orchestration:

Semi-Sync: If the primary dies, Semi-Sync does nothing. The cluster sits there broken. You must rely on external tools (like Orchestrator or manual DBA intervention) to detect the crash, pick the most up-to-date secondary, wait for its relay log to apply, disable read_only, and re-point the application.
GR: The cluster detects the failure natively. The remaining nodes use Paxos consensus to elect a new primary automatically, manage the queue drain natively via the super_read_only flip we discussed, and self-heal.


Split-Brain Protection (Network Partitions):

Semi-Sync: If our network splits in half, an external failover tool might accidentally promote a secondary while the old primary is still alive and accepting writes. We now have a split-brain, and our data is permanently corrupted.
GR: GR enforces strict quorum. If a network split happens, the side of the network with the minority of nodes will automatically fence itself off and refuse all writes. Split-brain is mathematically prevented.


The Certification Database:

As we established, GR requires the certification map to ensure the new primary doesn't accept writes that conflict with its unapplied queue. Semi-Sync does not have this; it relies entirely on the external failover tool to guarantee no writes touch the new primary until the relay log is 100% applied.



3. Final observation
If we are using Single-Primary GR with relaxed flow control, we have essentially built a highly-automated, consensus-driven version of Semi-Sync replication. 
We have the exact same apply-queue bottleneck during failover, but we have traded the need for external orchestrator tools for built-in Paxos consensus and native split-brain protection.
 
Conclusions (for real)
When we run MySQL on a traditional, dedicated Virtual Machine, memory limits are &quot;soft.&quot; If the certification_info database explodes and consumes an extra 10GB of RAM because of the applier lag, the Linux OS might start aggressively swapping inactive pages to disk, but the MySQL process usually survives. Performance degrades, but the database stays online.
In Kubernetes, memory limits are &quot;hard.&quot; As we discussed earlier, Kubernetes enforces pod memory limits via cgroups v2 (memory.max). The Linux kernel's OOM Killer has no understanding of database quorum, failover states, or apply queues. It only sees math: Working Set Size &amp;gt; memory.max = Terminate Process (Exit Code 137).
The Chain Reaction of Relaxed Flow Control in k8s
If we prioritize &quot;performance&quot; by relaxing Flow Control in a Kubernetes environment, we are essentially setting a ticking time bomb. Here is the chain of events:

The Spike: Our application experiences a massive write spike.
The Queue: The secondary pod's disk cannot keep up, and its applier queue grows to 1,000,000 transactions.
The Memory Sprawl: Because the queue is large, the global low-watermark stalls. The Primary pod is forbidden from garbage collecting the certification_info map. The in-memory hash map balloons in size.
The Execution: The memory.current metric will reach the memory.max, kernel will trigger the OMMKill process. First action will be to try to free the page.cache related to the process. If the purge is successful and the memory.current is less than memory.max then the process will persist, otherwise the kernel will kill it.  We can use the WSS metric to predict a successful OMMKill.  The Primary pod's Working Set Size (WSS) breaches its Kubernetes memory limit, this is a fair estimate not an absolute value.
The Catastrophe: The Linux OOM Killer instantly assassinates the Primary MySQL process.

Because we tried to avoid a few seconds of write latency by keeping relaxed Flow Control, we inadvertently caused a hard crash of the primary database pod, with long write downtime.
The Architectural Law
Therefore, here is my statement as architectural law for containerized environments: In Kubernetes, High Availability and Pod stability are so intrinsically linked that Flow Control must act as hard as it can to cap the apply queue.

We cannot allow unbounded memory growth in a container. The only way to bound certification_info memory is to bound the apply queue.
The only way to bound the apply queue is with strict, aggressive Flow Control.
Increasing the number of replication appliers helps but is not the conclusive answer.

In a Kubernetes environment, we must tune group_replication_flow_control_applier_threshold to a strict, low number, and accept that during massive traffic spikes, our application will experience write throttling. It is infinitely better for our application's connection pool to wait 2 seconds for a COMMIT to succeed than for the primary database pod to be violently OOMKilled by the kernel, and have to wait for minutes or hours to recover write capabilities.
Note
Just as a mention this is exactly how Percona Operator with Percona Xtradb Cluster works. To be more specific, PXC and in general solutions based on Galera have a Flow Control mechanism that enforces the queue to be inside hard limits. While this more invasive control may be noticeable at application level, it guarantees that the other nodes are not lagging behind the primary and this is why it is a stronger HA solution in the Kubernetes environment.
 
Reference
https://github.com/Tusamarco/mysqloperatorcalculator
Managing Resources and OOMKills: Resource Management for Pods and Containers
(This page details how memory limits are enforced reactively by the Linux kernel via OOM kills).
How WSS triggers Evictions: Node-pressure Eviction
(This page explicitly details how the kubelet uses the memory.available signal, which is derived from node capacity minus the working set size).
Latest changes. Pointer to the code
Swap Memory Management (Core Concepts &amp;amp; Configuration): https://kubernetes.io/docs/concepts/cluster-administration/swap-memory-management/</description>
    <content:encoded><![CDATA[<p><span>It is time to talk again about Flow control and group replication. This time with a special eye on the use of Group Replication in the Kubernetes context. In this article we will dig a bit on how it works and what are the various side effects. </span></p>
<p> <a href="http://www.tusacentral.com/joomla/images/stories/pxc_vs_gr_stale_reads/flow_control_front_image.jpg"><img src="http://www.tusacentral.com/joomla/images/stories/pxc_vs_gr_stale_reads/flow_control_front_image_small.jpeg" alt="flow control front image" width="700" height="382"></a></p>
<h2><span>The problem</span></h2>
<p><span>Recently I was refining the calculation I use in the </span><a href="https://github.com/Tusamarco/mysqloperatorcalculator"><span>MySQL calculator for Operator</span></a><span> given I was constantly encountering a very serious problem with the Percona Server Operator.</span></p>
<p><span>The problem is that when the deployment was/is serving a high level of traffic, it will, no matter what, end up in getting OMMKill by the K8 system. </span></p>
<p><span>This because the pod was gradually consuming more and more memory, reaching the memory limit set in the CR specification. </span></p>
<p> </p>
<p><span>Now let me clarify a few things, to get straight to the facts.</span></p>
<p><span>Kubernetes itself does not OOMKill a pod for hitting its memory limit, the mechanism works as described below with mention on how Working Set Size (WSS) is calculated, and how OOMKills are triggered, and in the resource sections, the links to the official documentation and source code.</span></p>
<p> </p>
<h3><span>1. The Reality of OOMKills vs. Kubelet Evictions</span></h3>
<p><span>It is crucial to distinguish between what the Linux kernel does and what Kubernetes does:</span></p>
<ul>
<li aria-level="1"><b>OOMKilled (Exit Code 137):</b><span> This is executed entirely by the </span><b>Linux kernel's OOM Killer</b><span>, not Kubernetes. When we set a memory limit in our Pod spec, Kubernetes translates that into a Linux cgroup constraint (</span><span>memory.limit_in_bytes</span><span> for cgroups v1, or </span><span>memory.max</span><span> for cgroups v2). If our container attempts to allocate more memory than this hard limit, and the kernel cannot reclaim any page cache (like inactive files), the kernel directly intervenes and terminates the process.</span></li>
<li aria-level="1"><b>Node-Pressure Evictions:</b><span> This is where Kubernetes actively observes memory. The </span><span>kubelet</span><span> monitors the </span><span>working_set_bytes</span><span> metric to protect the </span><i><span>node</span></i><span> from running out of memory. If the node's memory drops below an eviction threshold, Kubernetes will actively evict pods to prevent the kernel from initiating a system-wide OOM kill.</span></li>
</ul>
<h3><span>2. How Working Set Size (WSS) is Calculated for the container</span></h3>
<p><span>Kubernetes monitors container memory via </span><b>cAdvisor</b><span>, which is integrated directly into the </span><span>kubelet</span><span>. cAdvisor calculates the Working Set Size by taking the total memory usage and subtracting the inactive file cache (memory that the kernel can easily reclaim if it faces memory pressure).</span></p>
<p><span>Because active file caches and anonymous memory (like our application's heap) cannot be easily evicted, this working set metric is the most accurate representation of the memory your container is forcing the system to hold.</span></p>
<p> </p>
<p><span>The Calculation &amp; cgroups Evolution The core mathematical calculation is </span><i><span>Memory Usage</span></i><i><span> - </span></i><i><span>Inactive File Cache</span></i><span>, but </span><i><span>how</span></i><span> cAdvisor fetches this data from the Linux kernel depends entirely on your node's cgroup version. Modern cAdvisor relies heavily on the </span><span>opencontainers/runc/libcontainer</span><span> library to read these raw cgroup files:</span></p>
<ul>
<li aria-level="1"><span>cgroups v1: cAdvisor starts with the raw usage from </span><span>memory.usage_in_bytes</span><span> and subtracts the reclaimable cache found under the </span><span>total_inactive_file</span><span> key.</span></li>
<li aria-level="1"><span>cgroups v2 (Unified): cAdvisor starts with the raw usage from </span><span>memory.current</span><span> and subtracts the reclaimable cache found under the </span><span>inactive_file</span><span> key.</span></li>
</ul>
<p> </p>
<p><span>The Underlying Code Logic While older versions used a static </span><span>setMemoryStats</span><span> function, modern Kubernetes branches handle this dynamically. The logic executes the following flow before reporting back to the </span><span>kubelet</span><span>:</span></p>
<ol>
<li aria-level="1"><span>Detects Version: It identifies whether the node runs cgroups v1 or v2 to determine the correct inactive file key name.</span></li>
<li aria-level="1"><span>Fetch Usage: It pulls the raw memory usage from the container.</span></li>
<li aria-level="1"><span>Subtract Cache: It looks up the inactive file value and safely subtracts it from the usage (including a safeguard to ensure the working set never drops below zero).</span></li>
<li aria-level="1"><span>Report Metric: It sets this final calculated value as </span><span>container_memory_working_set_bytes</span><span>, which the </span><span>kubelet</span><span> then uses to decide if the node is under memory pressure.</span></li>
</ol>
<h2><span>Back to us </span></h2>
<p><span>At the end the point is that if our pod reaches the limit and we ARE NOT using the new </span><a href="https://docs.google.com/document/d/1WSoJxaAPMP4tdiT_-U4YwgoHBI8zKJ0Hw-y6iUXJQBc/edit#bookmark=id.59zqn83hsx1p"><span>swap feature</span></a><span> existing in Kubernetes, our pod will be brutally killed, and in 99% of the cases our production will suffer a lot. !Ops spoiler!</span></p>
<p> </p>
<p><span>To clearly understand what was causing the issue about this memory consumption and having my calculator fail, I started to collect the information about the memory usage in MySQL itself.</span></p>
<p> </p>
<p><span>SELECT EVENT_NAME,CURRENT_NUMBER_OF_BYTES_USED / 1024 / 1024 AS current_usage_mb FROM performance_schema.memory_summary_global_by_event_name WHERE EVENT_NAME like 'memory/%' and EVENT_NAME not like 'memory/performance%'  order by current_usage_mb desc limit 25;</span></p>
<p><span>Which will give you and output like this:</span></p>
<pre class="lang:default decode:true">+---------------------------------------+------------------+
| EVENT_NAME                            | current_usage_mb |
+---------------------------------------+------------------+
| memory/innodb/buf_buf_pool            |   46398.92578125 |
| memory/group_rpl/GCS_XCom::xcom_cache |    1066.66179943 |
| memory/group_rpl/certification_info   |      92.45250702 |
| memory/innodb/log_buffer_memory       |      64.00096130 |
| memory/sql/TABLE                      |      49.90627003 |
| memory/innodb/memory                  |      34.68734741 |
| memory/innodb/ut0link_buf             |      24.00006104 |
| memory/innodb/lock0lock               |      21.40064240 |
| memory/mysqld_openssl/openssl_malloc  |       9.51009655 |
| memory/innodb/read0read               |       8.19496155 |
| memory/mysys/KEY_CACHE                |       8.00215149 |
| memory/innodb/sync0arr                |       7.03147125 |
| memory/innodb/ha_innodb               |       6.87006950 |
| memory/innodb/lock_sys                |       5.25009155 |
| memory/sql/log_sink_pfs               |       5.00003052 |
| memory/innodb/ut0pool                 |       4.00017548 |
| memory/sql/dd::objects                |       2.83031464 |
| memory/innodb/std                     |       2.72618866 |
| memory/innodb/os0file                 |       2.63054657 |
| memory/innodb/os0event                |       2.34302521 |
| memory/sql/TABLE_SHARE::mem_root      |       2.31734467 |
| memory/innodb/trx0trx                 |       2.22647858 |
| memory/temptable/physical_ram         |       1.00003052 |
| memory/sql/dd::String_type            |       0.94942093 |
| memory/innodb/btr0pcur                |       0.89743423 |
+---------------------------------------+------------------+
</pre>
<p> </p>
<p><span>Plus I used PMM to collect memory information </span></p>
<p><a href="http://www.tusacentral.com/joomla/images/stories/pxc_vs_gr_stale_reads/allocation_with_incidents_describe.jpg"><img src="http://www.tusacentral.com/joomla/images/stories/pxc_vs_gr_stale_reads/allocation_with_incidents_describe.jpg" alt="allocation with incidents describe" width="600" height="308" class="alignnone wp-image-49857 size-medium_large"></a></p>
<p><span>To simulate the load I used the sysbench-tpcc (tpc-c derivate test) variant and run the tests simulating a load of 1024 threads against a cluster based on machine with 16 Core and 64Gb volumes ~3k IOPS, so not gigantic but not small. </span></p>
<p> </p>
<p><span>The finding was almost immediate:</span></p>
<pre class="lang:default decode:true">+---------------------------------------+------------------+
| EVENT_NAME                            | current_usage_mb |
+---------------------------------------+------------------+
| memory/innodb/buf_buf_pool            |   46398.92578125 |
| memory/group_rpl/certification_info   |    1431.67934418 | &lt;constantly increasing
| memory/group_rpl/GCS_XCom::xcom_cache |    1066.63542366 |
| memory/sql/Gtid_set::Interval_chunk   |      95.52413940 |
| memory/innodb/log_buffer_memory       |      64.00096130 |
| memory/sql/TABLE                      |      48.17613125 |
| memory/innodb/memory                  |      35.08897400 |
| memory/innodb/ut0link_buf             |      24.00006104 |
| memory/innodb/lock0lock               |      21.40064240 |
| memory/innodb/read0read               |      14.86782837 |
| memory/mysqld_openssl/openssl_malloc  |      12.05916119 |
| memory/mysys/KEY_CACHE                |       8.00215149 |
| memory/innodb/sync0arr                |       7.03147125 |
| memory/innodb/ha_innodb               |       6.84074974 |
| memory/innodb/lock_sys                |       5.25009155 |
| memory/sql/log_sink_pfs               |       5.00003052 |
| memory/innodb/ut0pool                 |       4.00017548 |
| memory/sql/dd::objects                |       2.82012177 |
| memory/innodb/std                     |       2.72515869 |
| memory/innodb/os0file                 |       2.63054657 |
| memory/innodb/os0event                |       2.35884857 |
| memory/innodb/trx0trx                 |       2.22647858 |
| memory/sql/TABLE_SHARE::mem_root      |       1.83777618 |
| memory/innodb/trx0undo                |       1.26304626 |
| memory/mysys/lf_node                  |       1.08828735 |
+---------------------------------------+------------------+
</pre>
<p> </p>
<p><span> </span><span>Ok then … What is the certification info???</span></p>
<h2><span>What is group_rpl/certification_info?</span></h2>
<p><span>In MySQL, </span><span>memory/group_rpl/certification_info</span><span> is a Performance Schema memory instrument. It tracks the exact amount of RAM allocated to store the Certification Database (or Certification Info).</span></p>
<p><span>In Group Replication, nodes do not lock rows across the network while a transaction is executing. Instead, transactions execute locally and optimistically. When it is time to commit, the transaction undergoes a </span><i><span>Certification Process</span></i><span> to ensure no other concurrent transaction in the cluster has modified the exact same rows. The </span><span>certification_info</span><span> buffer is the in-memory hash map that makes this conflict detection possible.</span></p>
<h3><span>1. What is it used for?</span></h3>
<p><span>The </span><span>certification_info</span><span> structure acts as a tracking ledger for recently modified rows.</span></p>
<p><span>Here is how it works under the hood:</span></p>
<ul>
<li aria-level="1"><span>The Key-Value Pair: It is fundamentally an in-memory dictionary. The </span><i><span>key</span></i><span> is the hash of a modified row (extracted from the transaction's "write set"), and the </span><i><span>value</span></i><span> is the Global Transaction Identifier (GTID) of the transaction that successfully modified it.</span></li>
<li aria-level="1"><span>Conflict Detection: When a new transaction attempts to commit, it broadcasts its write set and the "snapshot version" of the database it saw when it started. The certifier cross-references the incoming transaction's write set against the </span><span>certification_info</span><span> map.</span></li>
<li aria-level="1"><span>The Decision: If the </span><span>certification_info</span><span> shows that a row was modified by a newer GTID that the incoming transaction did not "see" when it started, a conflict is flagged, and the transaction is aborted. If no conflict exists, the transaction is certified, and the </span><span>certification_info</span><span> map is updated with the new write set and GTID.</span></li>
</ul>
<p><span>The primary does not hold onto this memory out of stubbornness; it does so because purging that data too early would destroy the cluster's consistency in the event of a failover.</span></p>
<p> </p>
<p><span>In Group Replication, garbage collection for the </span><span>certification_info</span><span> buffer is not triggered just because a transaction commits on the primary. It is triggered by a concept called the Stable Set. </span></p>
<p><span>Every node in the cluster periodically broadcasts a message to the rest of the group saying, </span><i><span>"Here are the GTIDs I have successfully applied to my disk."</span></i><span> The cluster then calculates a </span><i><span>global low watermark</span></i><span>. This watermark is the highest transaction GTID that </span><i><span>every single member</span></i><span> of the group has successfully applied. Garbage collection is only allowed to purge write-sets from the certification database that fall </span><i><span>below</span></i><span> this global watermark. </span><span> </span><span>To note that this purge is a synchronous operation during which writes are forbidden.</span></p>
<h3><span>2. How the Apply Queue Stalls the Watermark</span></h3>
<p><span>When a secondary node starts lagging, its </span><i><span>applier queue</span></i><span> grows. This means the secondary is receiving transactions from the network quickly, but its SQL thread is too slow to actually execute them and commit them to disk.</span></p>
<p><span>Because the secondary hasn't applied these transactions, it cannot report those GTIDs back to the group as "finished."</span></p>
<ul>
<li aria-level="1"><span>The lagging secondary's local watermark stalls.</span></li>
<li aria-level="1"><span>Therefore, the </span><i><span>global low watermark</span></i><span> for the entire cluster stalls.</span></li>
<li aria-level="1"><span>Because the global watermark hasn't moved forward, the </span><span>garbage_collect</span><span> function on the primary (and all other nodes) says, </span><i><span>"I am not allowed to delete any write-sets yet."</span></i></li>
<li aria-level="1"><span>As the primary continues to process new writes, the </span><span>certification_info</span><span> memory buffer grows continuously.</span></li>
</ul>
<h3><span>3. Why the Primary Cannot Purge Early</span></h3>
<p><span>we might wonder: </span><i><span>If the transaction is already committed on the primary, why does the primary care if the secondary has applied it? Why not just drop the write-set from its own memory?</span></i></p>
<p><span>The answer comes down to </span><i><span>Failover Safety</span></i><span> and </span><i><span>Distributed Conflict Detection</span></i><span>. GR is a shared-nothing, decentralized architecture. Even if you are running in Single-Primary  mode (keep this in mind will be important later), the underlying engine uses the exact same logic as Multi-Primary mode. </span></p>
<p><span>Here is why the primary is forbidden from purging that data:</span></p>
<ul>
<li aria-level="1"><span>The Failover Scenario: Imagine our primary node crashes right now. The lagging secondary (which still has a massive apply queue) is immediately elected as the new primary.</span></li>
<li aria-level="1"><span>The Conflict Risk: As the new primary, it starts accepting new writes from your application. However, it still has thousands of old transactions in its applier queue that it hasn't written to disk yet!</span></li>
<li aria-level="1"><span>The Necessity of the Buffer: When a new write comes in, the new primary </span><i><span>must</span></i><span> check if that write conflicts with any of the pending transactions in its apply queue. It does this by checking the </span><span>certification_info</span><span> map. If the old primary had purged the global certification data early, the new primary wouldn't have the write-sets for those pending transactions. It would blindly accept the new write, causing a massive data conflict and breaking the replication group entirely.</span></li>
</ul>
<p><span>Fine Marco, then what is the effect of this?</span></p>
<p> </p>
<p><span>Well, drums roll …</span></p>
<p><span>… When a secondary node is elected as the new primary during a failover, it does not immediately open the floodgates to new writes. </span><b>It keeps its </b><b><i>super_read_only</i></b><b> variable set to ON until it has completely drained its local apply queue of all transactions that were certified prior to the election.</b></p>
<p><span>This is an intentional design choice to guarantee that the new primary's state is completely consistent with the old primary before it starts accepting new data.</span></p>
<p> </p>
<h3><span>4. Immediate Write Rejections (No Built-in Queuing)</span></h3>
<p><span>The most critical impact to understand is that the new primary does not queue or pause new incoming writes while it catches up. It outright rejects them.</span></p>
<p><span>If our application or proxy routes a COMMIT, INSERT, UPDATE, or DELETE to the new primary while it is still processing the old queue, MySQL will immediately throw an error back to the client:</span></p>
<p><span>ERROR 1290 (HY000): The MySQL server is running with the --super-read-only option so it cannot execute this statement</span></p>
<h3><span>5. The "Brownout" Window (Write Outage)</span></h3>
<p><span>Because of this behavior, a failover in MySQL Group Replication does not instantly restore write availability. Our cluster experiences a "brownout", a period where reads might succeed, but writes are entirely blocked.</span></p>
<p><span>The duration of this write outage is directly proportional to the size of the apply queue.</span></p>
<ul>
<li aria-level="1"><span>If the secondary was fully caught up, write availability is restored in milliseconds.</span></li>
<li aria-level="1"><span>If the secondary was lagging by 50 minutes, your application will suffer a 50 minute write outage while the node applies the backlog.</span></li>
</ul>
<h3><span>6. Impact on Proxies (e.g., MySQL Router or ProxySQL)</span></h3>
<p><span>If we are using a proxy layer to route your database traffic, the apply queue dictates how the proxy behaves during the transition:</span></p>
<ul>
<li aria-level="1"><span>MySQL Router: It continuously monitors the cluster topology and the super_read_only flag. Even though the node has technically been elected primary, Router will not open the read-write port to it until the apply queue drains and super_read_only flips to OFF. Depending on your application timeouts, client connections will either hang waiting for a writable connection or fail completely.</span></li>
<li aria-level="1"><span>ProxySQL: Similar to Router, if it is configured to check for the read_only state, it will temporarily quarantine the new primary from the write hostgroup.</span></li>
<li aria-level="1"><span>HAProxy (in Operator): Monitor both Primary state and read_only state, but it expose the Primary to writes causing the application to fail (bug we need to fix)  </span></li>
</ul>
<h3><span>7. Read Traffic and Stale Data</span></h3>
<p><span>During this catch-up phase, the node will accept incoming </span><span>SELECT</span><span> queries (since it is still a valid database). However, because it is actively churning through the old primary's backlog, the data being read is temporarily stale.</span></p>
<p><span>If your application reads a row that is sitting in the apply queue but hasn't been committed to disk yet, it will get the old version of that row.</span></p>
<h2><span>Why Flow Control is Critical</span></h2>
<p><span>Because a large apply queue turns a seamless failover into a severe, application-breaking write outage, Group Replication includes the Flow Control feature.</span></p>
<p><span>Flow Control monitors the size of the apply queues across all secondaries. If a secondary starts lagging too far behind, Flow Control should actively throttle the write throughput on the </span><i><span>current</span></i><span> primary to allow the lagging node to catch up. It is essentially a trade-off: we accept a slight performance hit during normal operations to guarantee that your database recovers almost instantly during a failover.</span></p>
<p><b>However, this is not what really happens</b><span>.</span></p>
<h3><span>1. It is Reactive, Not Proactive (The Polling Blind Spot)</span></h3>
<p><span>Flow control does not intercept and evaluate every single transaction in real-time. Instead, it relies on a periodic polling interval governed by </span><span>group_replication_flow_control_period</span><span> (which defaults to 1 second).</span></p>
<p><span>Once a second, the cluster checks the size of the apply queues and the certifier queues.</span></p>
<ul>
<li aria-level="1"><span>The Vulnerability: If our application generates a massive spike of 50,000 writes in 500 milliseconds, the primary will happily accept and certify all of them. Flow control will not even notice the spike until the next 1 second polling interval hits. By the time it decides to apply a throttle, the damage is already done, and the secondary's queue is already overflowing.</span></li>
</ul>
<h3><span>2. The PID Controller's "Soft Brake" Math</span></h3>
<p><span>When flow control does decide to throttle, it does not simply freeze the primary. It uses a PID (Proportional-Integral-Derivative) controller algorithm to calculate a "write quota" (the maximum number of transactions the primary is allowed to commit in the next second).</span></p>
<p><span>The PID controller is deliberately tuned to be gentle. It wants to gracefully degrade performance rather than cause immediate application timeouts.</span></p>
<ul>
<li aria-level="1"><span>When the secondary's queue breaches the </span><span>group_replication_flow_control_applier_threshold</span><span> (default 25,000 transactions), the PID controller reduces the primary's quota incrementally.</span></li>
<li aria-level="1"><span>The Failure Point: If the primary's incoming write rate is astronomically higher than the secondary's disk IO capacity, this incremental "step down" in the quota is too slow. The primary is still allowed to write, say, 10,000 transactions per second, while the secondary is only applying 2,000. The queue continues to grow aggressively despite the throttle being "active."</span></li>
</ul>
<h3><span>3. The Concurrency Mismatch (Parallel vs. Serial)</span></h3>
<p><span>This is often the silent killer that defeats flow control. Flow control makes mathematical assumptions about how fast the secondary </span><i><span>should</span></i><span> be able to apply transactions based on recent history.</span></p>
<p><span>However, the primary node might be executing writes using hundreds of highly concurrent threads. The secondary relies on the parallel applier to keep up. If the incoming workload suddenly includes transactions that cannot be parallelized, such as writes hitting overlapping rows, cascading foreign key updates, or DDL statements, the secondary's applier instantly drops from executing in parallel down to a single, serialized thread.</span></p>
<p><span>When this serialization happens, the secondary's applier rate plummets instantly. Flow control, which only checks in once a second and adjusts gradually, cannot brake the primary fast enough to compensate for the secondary suddenly dropping to a crawl.</span></p>
<h2><span>What can we do?</span></h2>
<p><span>At the moment of writing there are only two things that can be done.</span></p>
<ol>
<li aria-level="1"><span>Make Flow control more aggressive</span></li>
<li aria-level="1"><span>Increase the number of replication appliers</span></li>
</ol>
<p> </p>
<h3><span>1. Making Flow Control More Aggressive</span></h3>
<p><span>We can configure Flow Control to be a bit more aggressive. It will still remain a </span><i><span>suggestion</span></i><span> but a strong one.</span></p>
<p><span>How it works (The Configuration):</span></p>
<ul>
<li aria-level="1"><span>Lower the Threshold: By reducing </span><span>group_replication_flow_control_applier_threshold</span><span> (default is 25,000) to something like 1,000 or 500, we force the PID controller to kick in almost immediately when a spike occurs.</span></li>
<li aria-level="1"><span>Remove the Safety Net: By keeping  </span><span>group_replication_flow_control_min_quota</span><span> to </span><span>0 </span><span>(default), we remove the minimum write guarantee. If the secondary falls behind, Flow Control is allowed to throttle the primary's writes down to zero, also if this will never happen.</span></li>
<li aria-level="1"><span>Increase the Sensitivity: We can tweak the PID controller's math (using the derivative and proportional tuning variables) to react much more aggressively to queue growth.</span><span> </span><span>       group_replication_flow_control_hold_percent=100</span><span> </span><span>       group_replication_flow_control_release_percent=5</span></li>
</ul>
<p> </p>
<p><b>The reality check, does it work?:</b></p>
<p><span>If the expectation is to have a rigid control over the applier queue on the lagging secondary, then the answer is </span><b>NO</b><span>. No matter what, at the moment flow control is not designed to act as we are used to in PXC (Percona Xtradb Cluster), where we have a rigid control of the pending queue also at the cost of delaying the writes. In Group Replication  the Flow Control will never bring the write to 0, the unfortunate aspect is that the mechanism is not enough to keep the queue under control.</span></p>
<p> </p>
<h3><span>2. Increasing Replication Appliers </span></h3>
<p><span>To help the secondary chew through the queue faster, we can increase the number of parallel threads it uses to write to disk.</span></p>
<p><b>How it works</b><span>: We can increase the </span><span>replica_parallel_workers</span><span> (formerly </span><span>slave_parallel_workers</span><span>) setting. GR is exceptionally smart about this. Because of the certification process we discussed earlier, GR already knows exactly which transactions modify which rows. It uses a writeset-based dependency tracker to safely hand off non-conflicting transactions to multiple worker threads simultaneously.</span><span> </span><span>The formula that is normally used to calculate the number of replication workers is to set 2.5 workers for each available core. IE if we have 14000m CPUs in our CR (K8) then we can assign ~35 workers, this is definitely higher than the default value of 4.   </span></p>
<p><b>The reality check, does it work?</b><span>:  </span><b>Yes</b><span>, but only if our workload allows it.</span></p>
<ul>
<li aria-level="1"><span>The Catch - The Serialization Wall: Parallel appliers only work if the transactions do not conflict. If our application has 50 concurrent threads all trying to update the same "inventory count" row, or updating a highly contentious table, those transactions </span><i><span>cannot</span></i><span> be parallelized. The secondary's coordinator thread will see the row-level conflicts and force those transactions to wait in line and execute sequentially. We could allocate 128 parallel workers, but 127 of them will sit idle while one thread does all the work.</span></li>
<li aria-level="1"><span>The Catch - Context Switching: More threads do not magically create more disk IOPS. If we set the workers too high (e.g., beyond the physical CPU core count or disk IO capacity), the secondary's InnoDB engine will spend more time context-switching and fighting over internal mutex locks than actually committing data. In many cases, over-allocating parallel workers actually </span><i><span>slows down</span></i><span> the apply rate.</span></li>
</ul>
<h2><span>Do we have any conclusions?</span></h2>
<h3><span>1. If HA is the goal, enforce Strict Flow Control</span></h3>
<p><span>If our absolute top priority is High Availability, specifically achieving a near-zero Recovery Time Objective (RTO), we must configure an aggressive flow control.</span></p>
<ul>
<li aria-level="1"><span>The Logic: Fast failovers require small apply queues. To guarantee a small apply queue, we must strictly throttle the primary the millisecond the secondary starts to lag.</span></li>
<li aria-level="1"><span>The Trade-off: we are protecting the cluster's failover readiness at the expense of application write latency. If there is a massive write spike, our application will face timeouts and connection errors, but if the primary server suddenly catches fire, our database will recover and elect a new primary almost instantly.</span></li>
</ul>
<p><span>The problem is that Group Replication is not able to act like that today, this is something we eventually need to implement to have better HA.</span></p>
<h3><span>2. If Performance is the goal, relax Flow Control</span></h3>
<p><span>If our top priority is keeping the application fast and ensuring </span><span>COMMIT</span><span> latencies remain extremely low, we should relax flow control or rely on the generous defaults.</span></p>
<ul>
<li aria-level="1"><span>The Logic: By relaxing flow control, we allow the primary to run at the absolute maximum speed its local disks and CPU allow. It does not care if the secondaries fall behind. Our application users remain happy and experience zero throttling.</span></li>
<li aria-level="1"><span>The Trade-off: We are accepting severe risks to your HA posture. If the primary crashes while the secondaries have a massive apply queue, we will suffer a long write outage (the brownout) while the new primary catches up. Additionally, we are accepting the risk that the </span><span>certification_info</span><span> memory buffer will grow significantly on the primary and eventually have the pod OOMKilled .</span></li>
</ul>
<h3><span>3. Is this not what Asynchronous replication with semy-sync offers?</span></h3>
<p> </p>
<h4><i><span>1. The Similarities</span></i></h4>
<p><span>If we look purely at how a single transaction flows and how a failover behaves, GR and Semi-Sync look like twins:</span></p>
<ul>
<li aria-level="1"><span>The Durability Guarantee: </span><i><span>Semi-Sync:</span></i><span> The primary waits to commit until at least one secondary confirms it has received the transaction and written it to its local Relay Log. </span>
<ul>
<li aria-level="2"><i><span>GR:</span></i><span> The primary waits to commit until a majority quorum of nodes confirm they have received the transaction, certified it, and written it to their local relay logs.</span></li>
</ul>
</li>
<li aria-level="1"><span>The Failover Delay (The Queue):  In both systems, the secondary receiving the data does not mean the secondary has applied the data to its InnoDB tables.</span>
<ul>
<li aria-level="2"><span>If a crash happens, both systems require the new primary to completely execute its pending queue (Relay Log for Semi-Sync, Apply Queue for GR) before it is safe to accept new writes.</span></li>
</ul>
</li>
</ul>
<h4><i><span>2. The Crucial Differences</span></i></h4>
<p><span>If they behave so similarly, why use GR at all? </span><span> </span><span>The differences lie entirely in automation, consensus, and split-brain protection. Semi-Sync is just a data transport mechanism; GR is a full state-machine cluster.</span></p>
<p><span>Here is what GR gives you that Semi-Sync does not:</span></p>
<ul>
<li aria-level="1"><span>Automatic Election and Orchestration:</span>
<ul>
<li aria-level="2"><i><span>Semi-Sync:</span></i><span> If the primary dies, Semi-Sync does nothing. The cluster sits there broken. You must rely on external tools (like Orchestrator or manual DBA intervention) to detect the crash, pick the most up-to-date secondary, wait for its relay log to apply, disable </span><span>read_only</span><span>, and re-point the application.</span></li>
<li aria-level="2"><i><span>GR:</span></i><span> The cluster detects the failure natively. The remaining nodes use Paxos consensus to elect a new primary automatically, manage the queue drain natively via the </span><span>super_read_only</span><span> flip we discussed, and self-heal.</span></li>
</ul>
</li>
<li aria-level="1"><span>Split-Brain Protection (Network Partitions):</span>
<ul>
<li aria-level="2"><i><span>Semi-Sync:</span></i><span> If our network splits in half, an external failover tool might accidentally promote a secondary while the old primary is still alive and accepting writes. We now have a split-brain, and our data is permanently corrupted.</span></li>
<li aria-level="2"><i><span>GR:</span></i><span> GR enforces strict quorum. If a network split happens, the side of the network with the minority of nodes will automatically fence itself off and refuse all writes. Split-brain is mathematically prevented.</span></li>
</ul>
</li>
<li aria-level="1"><span>The Certification Database:</span>
<ul>
<li aria-level="2"><span>As we established, GR requires the certification map to ensure the new primary doesn't accept writes that conflict with its unapplied queue. Semi-Sync does not have this; it relies entirely on the external failover tool to guarantee no writes touch the new primary until the relay log is 100% applied.</span></li>
</ul>
</li>
</ul>
<h4><i><span>3. Final observation</span></i></h4>
<p><span>If we are using Single-Primary GR with relaxed flow control, we have essentially built a highly-automated, consensus-driven version of Semi-Sync replication. </span></p>
<p><span>We have the exact same apply-queue bottleneck during failover, but we have traded the need for external orchestrator tools for built-in Paxos consensus and native split-brain protection.</span></p>
<p> </p>
<h2><span>Conclusions (for real)</span></h2>
<p><span>When we run MySQL on a traditional, dedicated Virtual Machine, memory limits are "soft." If the </span><span>certification_info</span><span> database explodes and consumes an extra 10GB of RAM because of the applier lag, the Linux OS might start aggressively swapping inactive pages to disk, but the MySQL process usually survives. Performance degrades, but the database stays online.</span></p>
<p><span>In Kubernetes, memory limits are "hard." As we discussed earlier, Kubernetes enforces pod memory limits via cgroups v2 (</span><span>memory.max</span><span>). The Linux kernel's OOM Killer has no understanding of database quorum, failover states, or apply queues. It only sees math: </span><i><span>Working Set Size &gt; </span></i><i><span>memory.max</span></i><i><span> = Terminate Process (Exit Code 137).</span></i></p>
<h3><span>The Chain Reaction of Relaxed Flow Control in k8s</span></h3>
<p><span>If we prioritize "performance" by relaxing Flow Control in a Kubernetes environment, we are essentially setting a ticking time bomb. Here is the chain of events:</span></p>
<ol>
<li aria-level="1"><b>The Spike</b><span>: Our application experiences a massive write spike.</span></li>
<li aria-level="1"><b>The Queue</b><span>: The secondary pod's disk cannot keep up, and its applier queue grows to 1,000,000 transactions.</span></li>
<li aria-level="1"><b>The Memory Sprawl</b><span>: Because the queue is large, the global low-watermark stalls. The Primary pod is forbidden from garbage collecting the </span><span>certification_info</span><span> map. The in-memory hash map balloons in size.</span></li>
<li aria-level="1"><b>The Execution</b><span>: The </span><i><span>memory.current</span></i><span> metric will reach the </span><i><span>memory.max</span></i><span>, kernel will trigger the OMMKill process. First action will be to try to free the page.cache related to the process. If the purge is successful and the memory.current is less than </span><i><span>memory.max</span></i><span> then the process will persist, otherwise the kernel will kill it. </span><span> </span><span>We can use the WSS metric to predict a successful OMMKill.</span><span> </span><span> The Primary pod's Working Set Size (WSS) breaches its Kubernetes memory limit, this is a fair estimate not an absolute value.</span></li>
<li aria-level="1"><b>The Catastrophe</b><span>: The Linux OOM Killer instantly assassinates the Primary MySQL process.</span></li>
</ol>
<p><span>Because we tried to avoid a few seconds of write latency by keeping relaxed Flow Control, we inadvertently caused a hard crash of the primary database pod, with long write downtime.</span></p>
<h3><span>The Architectural Law</span></h3>
<p><span>Therefore, here is my statement as architectural law for containerized environments: </span><b>In Kubernetes, High Availability and Pod stability are so intrinsically linked that Flow Control </b><b><i>must</i></b><b> act as hard as it can to cap the apply queue.</b></p>
<ul>
<li aria-level="1"><span>We cannot allow unbounded memory growth in a container. The only way to bound </span><span>certification_info</span><span> memory is to bound the apply queue.</span></li>
<li aria-level="1"><span>The only way to bound the apply queue is with strict, aggressive Flow Control.</span></li>
<li aria-level="1"><span>Increasing the number of replication appliers helps but is not the conclusive answer.</span></li>
</ul>
<p><span>In a Kubernetes environment, we must tune </span><span>group_replication_flow_control_applier_threshold</span><span> to a strict, low number, and accept that during massive traffic spikes, our application </span><i><span>will</span></i><span> experience write throttling. It is infinitely better for our application's connection pool to wait 2 seconds for a </span><span>COMMIT</span><span> to succeed than for the primary database pod to be violently OOMKilled by the kernel, and have to wait for minutes or hours to recover write capabilities.</span></p>
<h3><span>Note</span></h3>
<p><span>Just as a mention this is exactly how Percona Operator with Percona Xtradb Cluster works. To be more specific, PXC and in general solutions based on Galera have a Flow Control mechanism that enforces the queue to be inside hard limits. While this more invasive control may be noticeable at application level, it guarantees that the other nodes are not lagging behind the primary and this is why it is a stronger HA solution in the Kubernetes environment.</span></p>
<p> </p>
<h2><span>Reference</span></h2>
<p><a href="https://github.com/Tusamarco/mysqloperatorcalculator"><span>https://github.com/Tusamarco/mysqloperatorcalculator</span></a></p>
<p><span>Managing Resources and OOMKills: </span><a href="https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/"><span>Resource Management for Pods and Containers</span></a></p>
<p><i><span>(This page details how memory limits are enforced reactively by the Linux kernel via OOM kills).</span></i></p>
<p><span>How WSS triggers Evictions: </span><a href="https://kubernetes.io/docs/concepts/scheduling-eviction/node-pressure-eviction/"><span>Node-pressure Eviction</span></a></p>
<p><i><span>(This page explicitly details how the </span></i><i><span>kubelet</span></i><i><span> uses the </span></i><i><span>memory.available</span></i><i><span> signal, which is derived from node capacity minus the working set size).</span></i></p>
<p><span>Latest changes. </span><a href="https://github.com/google/cadvisor/blob/195858077459e69455fd9621fcbaeaf377d69d0e/container/libcontainer/handler.go#L865"><span>Pointer to the code</span></a><span></span></p>
<p><span>Swap Memory Management (Core Concepts &amp; Configuration): </span><a href="https://kubernetes.io/docs/concepts/cluster-administration/swap-memory-management/"><span>https://kubernetes.io/docs/concepts/cluster-administration/swap-memory-management/</span></a></p>]]></content:encoded>
    <pubDate>Fri, 12 Jun 2026 08:12:19 +0000</pubDate>
    <dc:creator>Marco Tusa</dc:creator>
    <category>MySQL</category>
  </item>

  <item>
    <title>MySQL Contributor Summit 2026: Collaboration, Innovation, and Community-Driven Development </title>
    <guid isPermaLink="false">f03fd69d3ace4188a00f6e1fafe76d0f</guid>
    <link>https://blogs.oracle.com/mysql/mysql-contributor-summit-2026-collaboration-innovation-and-community-driven-development</link>
    <description>On May 26, The MySQL Community Team at Oracle welcomed MySQL contributors, customers, partners, and community members to the MySQL Contributor Summit at the Oracle Redwood Shores campus, with additional participants joining remotely.  The Contributor Summit brought Oracle engineers and community contributors together to exchange ideas, share ongoing work, and explore opportunities to collaborate on […]</description>
    <pubDate>Thu, 11 Jun 2026 00:34:16 +0000</pubDate>
    <dc:creator>Oracle MySQL Group</dc:creator>
    <category>MySQL</category>
    <category>MySQL Community</category>
    <category>mysql</category>
    <category>mysqlcommunity</category>
  </item>

  <item>
    <title>Opening Up the MySQL Bug Process</title>
    <guid isPermaLink="false">ae90517cb2445073c3c3d689b6500890</guid>
    <link>https://blogs.oracle.com/mysql/opening-up-the-mysql-bug-process</link>
    <description>The MySQL team has been working hard to foster innovation, strengthen collaboration with our community, support meaningful contributions, and grow the broader MySQL ecosystem through greater openness and transparency. We believe MySQL is at its best when everyone can see how progress is being made, where work is happening, and how issues move through the […]</description>
    <pubDate>Wed, 10 Jun 2026 17:59:39 +0000</pubDate>
    <dc:creator>Oracle MySQL Group</dc:creator>
    <category>MySQL</category>
    <category>MySQL Community</category>
    <category>mysql</category>
    <category>MySQL Community Engagement</category>
    <category>mysqlcommunity</category>
  </item>

  <item>
    <title>Speaking at COSCUP 2026 — Planning your upgrade to MySQL 9.7</title>
    <guid isPermaLink="false">https://ronaldbradford.com/blog/2026-06-10-coscup-2026-mysql-97-upgrade/</guid>
    <link>https://ronaldbradford.com/blog/2026-06-10-coscup-2026-mysql-97-upgrade/</link>
    <description>I am excited to be speaking at COSCUP 2026 in Taipei, Taiwan on August 8th and 9th. COSCUP (Conference for Open Source Coders, Users and Promoters) is one of the largest open source conferences in Asia, and it is always a privilege to present to the engaged and technically sharp community there.</description>
    <pubDate>Wed, 10 Jun 2026 00:00:00 +0000</pubDate>
    <dc:creator>Ronald Bradford</dc:creator>
  </item>

  <item>
    <title>MySQL 9.7 LTS Is Here: Upgrade and Modernize on a Stronger Community Edition</title>
    <guid isPermaLink="false">b54e1d2809b01bfc46b878332b13b72b</guid>
    <link>https://blogs.oracle.com/mysql/mysql-9-7-lts-is-here-upgrade-and-modernize-on-a-stronger-community-edition</link>
    <description>MySQL 9.7 LTS is here, establishing the new MySQL 9.7.x Long-Term Support release line. For organizations running MySQL today, this is the right time to evaluate upgrade plans and move toward a current, supported foundation. It is also a good moment for teams standardizing their database strategy to take a fresh look at MySQL. Whether […]</description>
    <pubDate>Tue, 09 Jun 2026 21:18:09 +0000</pubDate>
    <dc:creator>Oracle MySQL Group</dc:creator>
    <category>MySQL</category>
    <category>MySQL Community</category>
    <category>Database Modernization</category>
    <category>MySQL 9.7 LTS</category>
    <category>MySQL Community Edition</category>
    <category>MySQL Entperprise Edition</category>
    <category>MySQL Upgrade</category>
  </item>

  <item>
    <title>No More JSON Plumbing: MySQL 9.7 Community Levels Up Duality Views</title>
    <guid isPermaLink="false">59ed499c65f316472287840b5cdb352e</guid>
    <link>https://blogs.oracle.com/mysql/no-more-json-plumbing</link>
    <description>Modern applications often pass JSON back and forth with the database server. With MySQL, we have had great JSON support, but working with relational data as JSON usually meant generating documents manually with built-in JSON functions. When an application sent JSON back to the server, we often had to break that document apart and write […]</description>
    <pubDate>Tue, 09 Jun 2026 14:57:09 +0000</pubDate>
    <dc:creator>Oracle MySQL Group</dc:creator>
    <category>MySQL</category>
    <category>MySQL Community</category>
    <category>mysql</category>
    <category>mysqlcommunity</category>
  </item>

  <item>
    <title>Inserting in Two Tables in a Single Round-Trip with JSON Duality Views in MySQL 9.7</title>
    <guid isPermaLink="false">tag:blogger.com,1999:blog-9188714267863327820.post-2871020334408500883</guid>
    <link>https://jfg-mysql.blogspot.com/2026/06/inserting-in-two-tables-in-single-round-trip.html</link>
    <description>A few months ago, I was asking myself how to insert in two tables in a single round-trip to the database.&amp;amp;nbsp; I wanted to do that to optimize a process.&amp;amp;nbsp; My optimization involved splitting a table in two, which would need inserting in two tables atomically.&amp;amp;nbsp; The downside was changing an auto-commit INSERT to a transaction with two inserts, which was changing the shape of the workload</description>
    <content:encoded><![CDATA[A few months ago, I was asking myself how to insert in two tables in a single round-trip to the database.&amp;nbsp; I wanted to do that to optimize a process.&amp;nbsp; My optimization involved splitting a table in two, which would need inserting in two tables atomically.&amp;nbsp; The downside was changing an auto-commit INSERT to a transaction with two inserts, which was changing the shape of the workload]]></content:encoded>
    <pubDate>Thu, 04 Jun 2026 14:17:32 +0000</pubDate>
    <dc:creator>Jean-François Gagné</dc:creator>
    <category>MySQL 9.7</category>
    <category>optimization</category>
  </item>

  <item>
    <title>Summary of MySQL Public Discussion #4: Updates and Improvements to Contributions – Let’s Talk About What’s Next for MySQL</title>
    <guid isPermaLink="false">1e09c7d263b09f10d68ea89ea4746d28</guid>
    <link>https://blogs.oracle.com/mysql/summary-of-mysql-public-discussion-4</link>
    <description>One of the best things about MySQL has always been its community. Whether you’re building applications, running production databases, contributing code, creating tools, writing documentation, answering questions, or simply sharing feedback, you’ve helped make MySQL what it is today. In this discussion we shared updates on where we are today and had a discussion on […]</description>
    <pubDate>Thu, 04 Jun 2026 13:57:00 +0000</pubDate>
    <dc:creator>Oracle MySQL Group</dc:creator>
    <category>MySQL</category>
    <category>MySQL Community</category>
    <category>mysql</category>
    <category>mysqlcommunity</category>
  </item>

  <item>
    <title>From Question to Insight with MySQL Studio</title>
    <guid isPermaLink="false">c758572603bd3e73805bc62763aaf193</guid>
    <link>https://blogs.oracle.com/mysql/from-question-to-insight-with-mysql-studio</link>
    <description>When we introduced MySQL Studio, the goal was to bring the common parts of database development and analysis into one OCI workspace: SQL authoring, schema exploration, results visualization, and Ask Studio. The next step is making that workspace more useful during the everyday flow of MySQL work. For many MySQL developers, DBAs, and application teams, […]</description>
    <pubDate>Wed, 03 Jun 2026 08:05:56 +0000</pubDate>
    <dc:creator>Oracle MySQL Group</dc:creator>
    <category>MySQL</category>
    <category>MySQL Basics/How-To</category>
    <category>MySQL HeatWave</category>
  </item>

  <item>
    <title>Where can you find MySQL during June–August 2026?</title>
    <guid isPermaLink="false">f57572050796d83f15d8e5d40a1cb6f7</guid>
    <link>https://blogs.oracle.com/mysql/where-can-you-find-mysql-during-june-august-2026</link>
    <description>The MySQL Community team will be active across conferences, user group meetups, open source events, and regional community activities during the summer months. Whether you would like to hear about the latest MySQL 9.7 updates, meet the MySQL team, join a user group meetup, or connect with the broader open source and developer community, here […]</description>
    <pubDate>Mon, 01 Jun 2026 14:22:55 +0000</pubDate>
    <dc:creator>Oracle MySQL Group</dc:creator>
    <category>MySQL</category>
    <category>MySQL Community</category>
  </item>

  <item>
    <title>MySQL HeatWave Document Store for Modern Applications</title>
    <guid isPermaLink="false">daa087deec7f5e37ed8d8ff01f05caba</guid>
    <link>https://blogs.oracle.com/mysql/mysql-heatwave-document-store-for-modern-applications</link>
    <description>Modern applications generate and consume enormous amounts of semi-structured data. User profiles, product catalogs, IoT telemetry, application events, AI prompts, and content metadata rarely fit neatly into rigid relational schemas anymore. At the same time, enterprises still need the reliability, security, analytics, and operational maturity of a traditional database platform. This is where MySQL HeatWave […]</description>
    <pubDate>Fri, 29 May 2026 02:22:23 +0000</pubDate>
    <dc:creator>Oracle MySQL Group</dc:creator>
    <category>MySQL Enterprise</category>
    <category>MySQL HeatWave</category>
  </item>

  <item>
    <title>Top 4 MySQL ADO.NET Providers for 2026 </title>
    <guid isPermaLink="false">https://www.devart.com/blog/?p=180382</guid>
    <link>https://www.devart.com/blog/top-4-mysql-ado-net-providers-for-2026.html</link>
    <description>Every MySQL ADO.NET provider looks reliable during development. The real differences appear later, when async performance breaks under load, licensing becomes a problem, or integrations with EF Core and cloud environments start creating friction. Read the full article to find the right MySQL .NET provider for your stack.
The post Top 4 MySQL ADO.NET Providers for 2026  appeared first on Devart Blog.</description>
    <content:encoded><![CDATA[<p>Every MySQL ADO.NET provider looks reliable during development. The real differences appear later, when async performance breaks under load, licensing becomes a problem, or integrations with EF Core and cloud environments start creating friction. Read the full article to find the right MySQL .NET provider for your stack.</p>
<p>The post <a href="https://www.devart.com/blog/top-4-mysql-ado-net-providers-for-2026.html">Top 4 MySQL ADO.NET Providers for 2026 </a> appeared first on <a href="https://www.devart.com/blog">Devart Blog</a>.</p>]]></content:encoded>
    <pubDate>Tue, 26 May 2026 13:15:50 +0000</pubDate>
    <dc:creator>Alena Subotina</dc:creator>
    <category>ADO.NET Data Providers</category>
    <category>ADO.NET</category>
  </item>

</channel>
</rss>
