Planet MySQL

A first look at MySQL 26.7 Early Access

Ronald Bradford — Thu, 23 Jul 2026 00:00:00 +0000

MySQL has dropped its newest release , categorized as “Early Access” and available at https://labs.mysql.com/ . While this post is not going to go into depth, I wanted to at least validate the management changes you verify between normal MySQL upgrades.

MySQL 9.7 Community Edition: Smarter Join Planning with the Hypergraph Optimizer

Oracle MySQL Group — Wed, 22 Jul 2026 06:00:00 +0000

With the release of MySQL 9.7 Community Edition, the Hypergraph Optimizer is now available to everyone. This is a significant addition to MySQL and one that has generated a lot of excitement in the MySQL community. The promise is simple: better execution plans for complex queries, especially those with many joins. Like most new features, […]

OCI Cache and MySQL HeatWave: Better Together for High-Performance Applications

Olivier Dasini — Tue, 21 Jul 2026 12:14:36 +0000

Modern applications are expected to deliver instant responses while processing increasingly large volumes of data. Achieving this level of performance isn’t simply a matter of making the database faster.It requires placing the right workload on the right layer of the architecture. Some operations require ultra-fast repeated reads, others demand transactional consistency, while analytical queries benefit […]

The post OCI Cache and MySQL HeatWave: Better Together for High-Performance Applications first appeared on Data Daz (dasini.net) - Data Systems, AI, and Real-World Insights.

MySQL on OKE: Database Operations as Kubernetes State

Oracle MySQL Group — Mon, 20 Jul 2026 14:26:46 +0000

MySQL is one of the databases developers trust most when an application needs a proven, familiar, open source relational engine. Kubernetes has become the orchestration layer teams rely on to run and scale modern workloads. Put them together, and the question gets interesting: how do you run MySQL with the same declarative, repeatable operating model […]

From Tokyo to Seoul to Taipei: MySQL Community Conversations Across JAPAC

Oracle MySQL Group — Fri, 17 Jul 2026 17:08:56 +0000

Over the past year, we have taken important steps to increase transparency and engagement across the MySQL ecosystem. Through public roadmap discussions, Early Access releases, publication of worklogs, bug transparency and backlog reduction, community public discussions, increased use of GitHub discussions, and contributor events, we have created more opportunities for the community to understand what […]

Optimizing Replication Lag for Large Transactions and DDL in MySQL

Libing Song — Fri, 17 Jul 2026 09:30:00 +0000

This article is also available in Chinese: 中文版. Browse all English articles.

Since MySQL 5.6, the MySQL replication team has been working to reduce replication lag. The first step was schema-level parallel application of the binlog, but schema-level parallelism only helps when writes are spread across many databases; in the common case, where most write traffic hits a single database, it provides almost no parallelism. MySQL 5.7 then introduced the Commit-Order parallel-replay strategy, which depends on how many transactions run concurrently on the primary: the replica can replay quickly only when the primary is highly concurrent. When concurrency on the primary is low, the replica still replays slowly and lag builds up. To fix that, MySQL 5.7 also introduced the Writeset (row-level) strategy, which lets the replica replay in parallel quickly no matter how concurrent the primary is.

We rolled out the writeset-based strategy across our fleet long ago, and it eliminated roughly 60% of our replication-lag problems. Another more than 30% comes from large transactions and DDL — the hardest replication-lag problem to solve in MySQL. Last year we built a mechanism in AliSQL called Binlog Realtime Replication (BRR) that solves it completely.

How Binlog Realtime Replication Works

The figure above shows why large transactions and DDL cause replication lag. Binlog replication works at the granularity of a transaction: a transaction’s events are written to the binlog file only after it commits, then shipped to the replica and executed there (a DDL can be treated as a single transaction). The change becomes visible to applications only once the replica finishes executing. If a transaction takes a long time on the primary, it takes just as long on the replica, and the lag equals the replica’s execution time. In practice the lag is often worse. First, a large transaction produces very large binlog events, which adds transmission delay. Second, while a large transaction — especially a DDL — is running, it can block the replay of other transactions, so relay log piles up; once the large transaction or DDL finishes replaying, that backlog also needs time to drain before the replica catches up.

The idea behind the optimization is simple: have the replica start executing the large transaction or DDL at the same time as the primary, and once the primary commits, tell the replica to commit too. With this mechanism, replication lag for large transactions and DDL stays under one second. The chart below compares the lag from a large transaction before and after the optimization: with realtime replication, large transactions no longer cause lag, and neither do DDLs.

The feature has been enabled by default in our RDS service since 2025. To date more than 3,000 instances have used it, running realtime replication about 300,000 times for large transactions and about 60,000 times for DDL.

Implementing Realtime Replication

The core idea of realtime replication fits in one sentence: as soon as the primary starts executing, it ships the binlog events (or DDL) to the replica, which executes them in lockstep; when the primary finally commits or rolls back, the replica does the same.

Realtime replication has two parts: realtime transmission and realtime application. Realtime transmission streams the binlog events a large transaction produces on the primary to the replica as they are generated; that part is covered in Binlog Transmission Optimization for Large MySQL Transactions. Realtime application replays those events on the replica as they arrive, using a dedicated group of replay threads, as shown below:

While a transaction runs on the primary, the binlog events it produces are first buffered in the Binlog Cache. If the transaction is large (the Binlog Cache exceeds a threshold), the primary’s Dump thread reads the Binlog Cache temporary file and sends the events straight to the replica. The replica writes them into a dedicated Brr Cache (not the relay log file), where a new group of Brr Worker threads applies them in real time.

For DDL, binlog events are produced later than for a large transaction — a DDL writes its Query_log_event into the Binlog Cache only during the commit phase. BRR therefore handles DDL specially: once the primary starts executing the DDL, it builds the Query_log_event directly and puts it in an in-memory buffer, ddl_query_buffer; the Dump thread reads events from this buffer and sends them to the replica, where a Brr Worker again executes the DDL in real time.

As a result, replica execution of DDL and large transactions shifts from run only after the primary finishes to run on the primary and replica in parallel, leaving only network transmission and commit as the residual lag — typically on the order of tens of milliseconds.

Below we look at how BRR is implemented, from both the primary and the replica side.

Overall BRR Architecture

Primary Side

When a large transaction or DDL needs realtime replication, a Brr_trx is created and registered with the Brr_trx_manager.

Brr_binlog_sender is an extension of the Dump thread; it reads events from a Brr_trx and pushes them to the replica. Originally the Dump thread did just one thing: read events from the binlog file and send them to the replica. BRR gives it one more job — poll each active Brr_trx, read binlog events from its Binlog Cache temporary file or from ddl_query_buffer, and send them to the replica.

Realtime transmission reuses the existing Dump channel. To tell BRR traffic apart from ordinary traffic, BRR borrows an idea from Semisync and attaches an extra BRR Header to each event; the header identifies whether an event is BRR or ordinary replication traffic. And to keep BRR events from choking the ordinary binlog-event channel, BRR applies flow control.

Replica Side

Using the BRR Header, the replica’s IO thread splits events into two kinds: BRR events go into the Brr_cache, while normal events take the original path into the relay log.

Brr_cache is the replica-side storage for a BRR transaction; each BRR transaction has one Brr_cache. When the IO thread receives a BRR event, it uses the brr_index in the header to locate the matching Brr_cache (if it’s the first event, it creates a new Brr_cache and wakes a Brr Worker), writes the event into the Brr_cache temporary file, and updates the readable position.

Brr_rpl_info manages these BRR transactions.

The BRR Worker threads are dedicated to applying BRR transactions. When idle, a Brr Worker picks a Brr_cache that hasn’t started being applied and makes itself its owner. From then on it is bound to that Brr_cache, looping to read and replay binlog events until it sees a Gtid_log_event (the primary has committed) or receives a BRR_ROLLBACK_EVENT (the primary rolled back).

The gtid_executed Snapshot

The uncommitted BRR transactions from the primary run in parallel on the replica alongside already-committed transactions. If a BRR transaction depends on an already-committed one, its binlog events must not start until that dependency has finished replaying on the replica; Otherwise you get escalating failures: a deadlock, then a broken replication channel, and in the worst case data inconsistency between primary and replica. Take this example:

INSERT INTO t1(pk, c2) VALUES(pk1, 1);
UPDATE t1 SET c2 = 2;  -- large transaction

The UPDATE is the large transaction, and it must not begin until the INSERT has finished replaying. If the UPDATE runs first, it fails when updating the pk1 row because that row doesn’t exist yet.

BRR uses a gtid_executed snapshot to enforce these ordering dependencies. When a DDL or large transaction starts on the primary, the primary’s current gtid_executed captures every preceding transaction it saw. Once the replica’s gtid_executed has caught up to that value (that is, is a superset of it), all the transactions this one depends on have been replayed on the replica, and it is safe to start applying it.

To do this, BRR adds a new event type, Brr_gtid_executed_log_event, whose body holds a gtid_executed set. At specific moments the primary takes a gtid_executed snapshot and writes it to the BRR channel; when a replica Brr Worker reads the snapshot, it waits for all the GTIDs in it to finish before continuing.

Realtime Replication of Large Transactions

Creating and Updating a Brr_trx

When a transaction runs on the primary, its binlog events go first into the Binlog Cache (an in-memory buffer backed by a temporary file). In MySQL, once the Binlog Cache fills its in-memory buffer, it spills to the temporary file.

BRR hooks in here: after each batch of events is written to the Binlog Cache, it checks the temporary file’s size. Once the file exceeds a certain size, BRR creates a Brr_trx, records the temporary file name and the current readable position, and registers it with Brr_trx_manager. From then on, every append to the Binlog Cache updates the Brr_trx’s end_position and wakes the Dump thread to send those events to the replica.

Transmitting Binlog Events

Before sending each batch of binlog events, the Dump thread emits a Brr_gtid_executed_log_event as that batch’s dependency snapshot, then sends the batch itself.

Committing the Transaction

For a large transaction, the binlog events sit in the Brr_cache temporary file — not yet relay log — until the Brr Worker reads the Gtid_log_event. When the primary finally commits, it sends the Gtid_log_event over the BRR channel, and the IO thread does two things:

Renames the Brr_cache temporary file into a relay log file. Based on the GTID, the primary’s Dump thread then skips sending this transaction, so its events aren’t shipped again as an ordinary transaction.
Notifies the Brr Worker to read the Gtid_log_event and Xid_log_event and complete the commit.

Rolling Back the Transaction

The rollback path is straightforward: when the primary rolls back, it sends a BRR_ROLLBACK_EVENT over the BRR channel; on receiving it, the replica’s IO thread sends a KILL_QUERY signal to the corresponding Brr Worker. The Brr Worker detects KILL_QUERY, rolls back the current transaction, cleans up, and moves on to the next Brr_cache.

Note that after being killed, a Brr Worker neither exits nor propagates the error to the SQL thread — unlike an ordinary Worker, which must halt all replication on an error. The reason: for an ordinary Worker the transaction has already committed on the primary, so if the replica gives up, the two diverge. A Brr Worker’s transaction, by contrast, runs concurrently with the primary, so a primary rollback is the normal path and the replica must roll back as well.

Realtime Application of DDL

Creating a Brr_trx

For large transactions, we decide whether a transaction is “large” by the total size of its binlog events in the Binlog Cache. DDL is trickier: some DDLs only touch metadata and finish almost instantly, while for DDLs that touch data the run time depends on how much data is involved and is hard to estimate accurately. So instead of predicting a DDL’s run time up front, we decide whether to realtime-replicate it by whether its execution exceeds a timeout.

Every DDL creates a Brr_trx, but that Brr_trx isn’t sent to the replica right away. A DDL’s Brr_trx has a threshold — 1000 ms by default — and only when the DDL’s run time exceeds it does the Dump thread start sending the Brr_trx. If a DDL finishes quickly, within one second, its Brr_trx is silently discarded and the DDL ships to the replica over the ordinary binlog channel, exactly as if BRR were off.

A DDL’s Brr_trx is created during the DDL’s Prepare phase — that is, after the DDL has acquired the MDL X lock — because only with the X lock does the DDL have permission to operate on the table. Any conflicting operations have either already committed or must wait until the DDL releases the X lock or finishes.

Two gtid_executed Snapshots

An Online DDL runs in three phases: Prepare, Execute, and Commit. After Prepare, the MDL X lock is downgraded to an S lock, so during Execut, DML and DDL can run in parallel. During Commit, the S lock is upgraded back to an X lock; regaining the X lock means all those parallel DMLs have already committed. The replica must honor the same rule: those committed DMLs have to finish replaying before the replica can enter the Commit phase.

So realtime replication of an Online DDL has two points on the replica that must be synchronized: one before entering Prepare, and one before entering Commit. Correspondingly, the primary takes two gtid_executed snapshots — one after the DDL enters Prepare, and one after it enters Commit.

Shipping Binlog Events Twice

In the large-transaction section we saw that a large transaction is transmitted to the replica via BRR, and the copy in the binlog file is not shipped again. DDL is different: it ships twice — once over BRR, and again as the binlog events in the binlog file.

A DDL’s Query_log_event is tiny, so shipping it twice costs almost nothing. Shipping it only once would force us into the large-transaction rename logic (renaming the Brr_cache temporary file into relay log), with all its edge cases. For DDL, simply shipping it twice and discarding the Brr_cache afterward is the simplest approach.

As for ordering, the Dump thread guarantees BRR events ship before ordinary events. That way the Brr Worker is sure to get the DDL first and start executing it; by the time the ordinary events reach the relay log, the Brr Worker is already applying the DDL.

When an ordinary Worker reads the DDL from the relay log, it checks whether the GTID is in owned_gtids. If it is (a Brr Worker is executing it), the ordinary Worker waits; once the Brr Worker commits, the GTID is added into gtid_executed. The ordinary Worker wakes and finds the GTID already in gtid_executed, so it skips the whole DDL.

If the Brr Worker rolled the DDL back, the GTID is removed from owned_gtids and never added to gtid_executed. The ordinary Worker then wakes and sees the transaction wasn’t executed. It runs the DDL normally — the fallback path, equivalent to running with BRR off.

Conclusion

AliSQL’s Binlog Realtime Replication tackles the thorniest lag in MySQL binlog replication — lag from large transactions and DDL — by executing on the primary and replica in parallel. On top of that, we’ve made optimizations for the writeset mechanism, for massively concurrent workloads, and for the medium-sized transactions that batch jobs produce. Together, these have eliminated 95% of the replication lag in our production environment.

MySQL Major Version Upgrade Checklist – how to

Kedar Vaijanapurkar — Thu, 16 Jul 2026 12:00:00 +0000

This article provides MySQL Major Version Upgrade Checklist along with video, one may follow to ease the upgarde task.

The post MySQL Major Version Upgrade Checklist – how to first appeared on Change Is Inevitable.

Missed the May 2026 MySQL Contributor Summit? Watch Every Session On Demand

Oracle MySQL Group — Thu, 16 Jul 2026 06:00:00 +0000

The inaugural MySQL Contributor Summit, held in May 2026, brought together Oracle engineers, customers, partners, and members of the open source community for a full day of technical collaboration focused on the future of MySQL. The Summit featured more than 20 sessions covering topics including AI integration, performance, observability, replication, developer experience, extensibility, and community […]

Binlog Transmission Optimization for Large MySQL Transactions

Libing Song — Thu, 16 Jul 2026 02:00:00 +0000

This article is also available in Chinese: 中文版. Browse all English articles.

Large transactions are a notorious problem in MySQL: they cause not only replication lag but also stability problems. A previous article, MySQL Large Transaction Commit Optimization, covered the problems a large transaction causes at commit time and the optimizations we made in AliSQL. This article looks at the problems a large transaction causes during semi-synchronous replication, and how AliSQL solves them.

In MySQL Large Transaction Commit Optimization we noted that writing the binlog when a large transaction commits can produce strange slow queries like these:

An INSERT that normally runs in an instant took 1.3s, yet the slow-query log shows no long lock wait.
Every statement in a multi-statement transaction had already finished, yet the COMMIT alone took 1.3s.

Besides writing the binlog at commit, transmitting a large transaction’s binlog during semi-synchronous replication produces the same symptom. Below is a simulated test: we used sysbench oltp_write_only to simulate a normal write workload, then in the background, a transaction that generated 2 GB of binlog events (with the large-transaction commit optimization already applied). When the large transaction commits, writes drop to zero and don’t recover until semisync times out.

Root Cause

The figure above shows the commit flow of a transaction under semi-synchronous replication:

On commit, the transaction runs two-phase commit, starting with Prepare.
It then writes its binlog events to the binlog file.
After writing the binlog, it waits for its binlog events to be sent to the replica (after_sync mode).
The binlog Dump thread then sends the transaction’s binlog events to the replica.
The replica’s IO thread receives these events and writes them into the relay log file.
Once it has the complete transaction, the IO thread sends the primary an acknowledgment saying it has all of the transaction’s binlog events. The ack is expressed as a binlog file name and offset. In the figure, Trx_n’s binlog end offset is 530, so the replica’s IO thread sends master-bin.000001:530 to the primary, meaning every transaction before master-bin.000001:530 has been received.
On the primary, the Semisync Ack Receiver thread receives the ack and, based on the offset, wakes the corresponding transaction.
Once woken, the transaction finishes committing and returns OK to the user.

There is only one Dump thread between the primary and the replica, and it transmits binlog events in the order they were written to the binlog. The replica’s IO thread likewise writes received events into the relay log in that same order before acknowledging the primary. So a later transaction can’t be sent until the earlier one has finished. If the current transaction has a huge number of binlog events, sending them takes a very long time, and a later transaction — however small — has to wait. That wait includes not just its own transmission time but the large transaction’s ahead of it. Hence the slow-log symptom: a small transaction suddenly becomes very slow.

To cope with this, MySQL provides the rpl_semi_sync_master_timeout parameter, which sets how long a transaction waits for an ack; once the wait exceeds rpl_semi_sync_master_timeout, replication automatically falls back to asynchronous. We can set this to a small value to avoid the severe case where a large transaction makes the whole instance unwritable.

An RPO = 0 Design Based on Semi-Synchronous Replication

Because a transaction under semi-synchronous replication can’t commit until its binlog has been replicated to a replica, it’s natural to think of using semisync to build an RPO = 0 (zero data loss) consistency solution.

This architecture needs two replicas, and semisync guarantees that a transaction commits only after it receives an ack from at least one of them.

If the primary crashes, the data has been replicated to at least one replica.
If one replica becomes unavailable, cluster availability is unaffected.

To guarantee RPO = 0, semisync must never fall back to async. MySQL semisync has two points where it can degrade to async:

After a crash and restart, transactions already written to the binlog are committed automatically, even though they may not yet have been replicated to a replica.
Once the wait reaches rpl_semi_sync_master_timeout, it degrades to async.

The former can’t be controlled from outside — it requires changing MySQL’s code. The latter requires setting rpl_semi_sync_master_timeout to a very large value so semisync never degrades. Large transactions are clearly the thorniest issue in an RPO = 0 design: the moment one appears, it makes the whole cluster unwritable, so the design must take countermeasures. A DBA with strong influence over the application can arrange for it to avoid large transactions; but at a large company, with sprawling and complex applications, eliminating them entirely is hard, and an RDS provider has no control over its users at all. In practice, availability usually matters far more than consistency, so many designs adopt a temporary-degradation strategy, falling back to async whenever a large transaction appears.

Realtime Transmission of Large Transactions

In AliSQL we designed a realtime-transmission mechanism to solve the problems large-transaction transmission causes; with it, there is no need to degrade semisync to async.

The realtime large-transaction transmission mechanism reads a transaction’s binlog events out of the Binlog Cache temporary file and sends them to the replica while the transaction is still doing DML. The key steps:

During DML execution, once the binlog events of a transaction has produced exceed a certain amount, the transaction is registered in the large-transaction list and handled as a large transaction.
Based on that list, the binlog Dump thread reads the large transaction’s binlog temporary file and sends its contents to the replica. The large transaction’s binlog events and the events from the binlog file are sent interleaved, with flow control on the large transaction: events from the binlog file take priority, so the transaction currently committing is unaffected.
The large transaction’s binlog events carry a special marker and extra information. When the replica’s IO thread receives them, it stores them in a temporary file called the Relay Log Cache.
At commit, once the Dump thread has sent all the binlog events, it sends a Gtid_event to the replica.
On receiving the Gtid_event, the replica knows it has all of the transaction’s binlog events, and it turns the Relay Log Cache into a Relay Log file.
When several large transactions run at once, the mechanism can transmit them all in real time simultaneously.

From these steps we can see: a large transaction’s binlog events are sent to the replica bit by bit as they are produced, so at commit only the Gtid_event needs to be sent. The amount of data sent at commit is therefore tiny, and it no longer blocks other transactions’ binlog-event transmission. It also removes the sudden burst of network traffic, reducing congestion.

Relay Log Cache

The realtime-transmission mechanism follows directly from the large-transaction commit optimization and reuses parts of its implementation. A transaction’s binlog events are produced and accumulate during DML execution; once they exceed binlog_cache_size, they are written to a temporary file, and at commit they are written to the binlog file all at once. In MySQL Large Transaction Commit Optimization, a large transaction’s temporary file is automatically turned into a new binlog file, which eliminates the problems that large-transaction commit causes.

Realtime large-transaction transmission reuses this logic, reserving some space at the head of the Relay Log Cache. When the Relay Log Cache is turned into a Relay Log file, that head space is filled with the special binlog events a relay log needs, such as the Format_description_event.

Handling Failures

A large transaction runs for a long time, so any failure along the way has to be handled.

If the large transaction rolls back on the primary, the binlog Dump thread sends a rollback to the replica; on receiving it, the IO thread destroys the corresponding Relay Log Cache.
If the IO thread’s connection to the primary drops, or a STOP SLAVE is issued, the IO thread destroys all Relay Log Caches. After reconnecting, it restarts realtime replication of the large transaction.

Results

We used sysbench oltp_write_only to simulate a normal write workload, then committed, in the background, a transaction that generated 2 GB of binlog events. The results:

With realtime replication, the application’s writes run smoothly, with no more drops to zero.

Conclusion

In MySQL’s semi-synchronous replication architecture, large transactions are a classic problem. To keep them from destabilizing the instance, people have had to work hard to eliminate large transactions from their applications, or simply let replication degrade to async. Realtime large-transaction transmission moves the transmission of a large transaction’s binlog events from the commit phase up to the execution phase, sending each event to the replica as soon as it is produced. This avoids blocking other transactions’ binlog-event transmission for a long time at commit, and avoids network congestion. When a large transaction comes along, semisync no longer needs to degrade to async — clearing away a thorny obstacle on the path to a semisync-based RPO = 0 design.

Inside MySQL 9.7 LTS Features

MySQL Performance Blog — Wed, 15 Jul 2026 05:00:23 +0000

MySQL 9.7, a Long-Term Support (LTS) release, incorporates a variety of potential features spanning across multiple technical domains. This article covers some of the primary features introduced and evaluates their practical utility within the MySQL database environment.

Following the End-of-Life (EOL) status of MySQL 8.0, this subsequent LTS release is designed to provide enhanced stability alongside significant architectural innovations.

Let’s discuss each of these features below with some examples and usage.

Flow-control monitoring in Group Replication

Flow control monitoring has been improved and provides more granularity by introducing the additional status variables listed below.

Gr_flow_control_throttle_count : It denotes the number of transactions that have been throttled.
Gr_flow_control_throttle_time_sum :It denotes the time in microseconds that transactions have been throttled.
Gr_flow_control_throttle_active_count :It denotes the number of transactions currently being throttled.
Gr_flow_control_throttle_last_throttle_timestamp : It denotes the most recent date and time that a transaction was throttled.

To use these status variables, we must install the “Group Replication Flow Control Statistics” component.

mysql> Install component 'file://component_group_replication_flow_control_stats';

After the component is installed, the statistics will be visible.

mysql> SELECT * FROM performance_schema.global_status WHERE VARIABLE_NAME LIKE 'Gr_flow_control%';
+--------------------------------------------------+----------------+
| VARIABLE_NAME                                    | VARIABLE_VALUE |
+--------------------------------------------------+----------------+
| Gr_flow_control_throttle_active_count            | 0              |
| Gr_flow_control_throttle_count                   | 0              |
| Gr_flow_control_throttle_last_throttle_timestamp |                |
| Gr_flow_control_throttle_time_sum                | 0              |
+--------------------------------------------------+----------------+

Multi-threaded applier extended statistics

We now have additional verbosity for the Applier threads for both Asynchronous and Group Replication topologies. This means we can get more details of the transactions or potential misbehaviours during the transactions applier stage. This feature is particularly useful for troubleshooting performance bottlenecks in multi-threaded replication environments, where understanding the specific cause of lag can be challenging.

This requires installing the “Replication Applier Metrics” component.

mysql> Install component 'file://component_replication_applier_metrics';

Upon successful installation of the requisite component, the performance schema tables facilitate tracking of transaction details and various performance metrics during the replication applier phase. For instance, monitoring the table “replication_applier_metrics” enables observing channel-specific operations.

mysql> SELECT * FROM performance_schema.replication_applier_metrics where CHANNEL_NAME='group_replication_applier'\G;
*************************** 1. row ***************************
                                CHANNEL_NAME: group_replication_applier
                  TOTAL_ACTIVE_TIME_DURATION: 0
                          LAST_APPLIER_START: 0000-00-00 00:00:00
                TRANSACTIONS_COMMITTED_COUNT: 0
                  TRANSACTIONS_ONGOING_COUNT: 0
                  TRANSACTIONS_PENDING_COUNT: 0
       TRANSACTIONS_COMMITTED_SIZE_BYTES_SUM: 0
    TRANSACTIONS_ONGOING_FULL_SIZE_BYTES_SUM: 0
TRANSACTIONS_ONGOING_PROGRESS_SIZE_BYTES_SUM: 0
         TRANSACTIONS_PENDING_SIZE_BYTES_SUM: NULL
                      EVENTS_COMMITTED_COUNT: 0
            WAITS_FOR_WORK_FROM_SOURCE_COUNT: 0
         WAITS_FOR_WORK_FROM_SOURCE_SUM_TIME: 0
            WAITS_FOR_AVAILABLE_WORKER_COUNT: 0
         WAITS_FOR_AVAILABLE_WORKER_SUM_TIME: 0
      WAITS_COMMIT_SCHEDULE_DEPENDENCY_COUNT: 0
   WAITS_COMMIT_SCHEDULE_DEPENDENCY_SUM_TIME: 0
         WAITS_FOR_WORKER_QUEUE_MEMORY_COUNT: 0
      WAITS_FOR_WORKER_QUEUE_MEMORY_SUM_TIME: 0
              WAITS_WORKER_QUEUES_FULL_COUNT: 0
           WAITS_WORKER_QUEUES_FULL_SUM_TIME: 0
             WAITS_DUE_TO_COMMIT_ORDER_COUNT: 0
          WAITS_DUE_TO_COMMIT_ORDER_SUM_TIME: 0
        TIME_TO_READ_FROM_RELAY_LOG_SUM_TIME: 0

In addition to aggregate metrics, MySQL 9.7 provides a way to inspect the progress of individual worker threads via monitoring stats in the “replication_applier_progress_by_worker” table. This level of detail helps administrators identify if a single transaction is monopolising a specific worker, causing overall replication delay.

mysql> SELECT * FROM performance_schema.replication_applier_progress_by_worker\G;
*************************** 1. row ***************************
                          CHANNEL_NAME: group_replication_applier
                             WORKER_ID: 0
                             THREAD_ID: 62
              ONGOING_TRANSACTION_TYPE: UNASSIGNED
   ONGOING_TRANSACTION_FULL_SIZE_BYTES: 0
ONGOING_TRANSACTION_APPLIED_SIZE_BYTES: 0
*************************** 2. row ***************************
                          CHANNEL_NAME: group_replication_applier
                             WORKER_ID: 1
                             THREAD_ID: 63
              ONGOING_TRANSACTION_TYPE: UNASSIGNED
   ONGOING_TRANSACTION_FULL_SIZE_BYTES: 0
ONGOING_TRANSACTION_APPLIED_SIZE_BYTES: 0
*************************** 3. row ***************************
                          CHANNEL_NAME: group_replication_applier
                             WORKER_ID: 2
                             THREAD_ID: 64
              ONGOING_TRANSACTION_TYPE: UNASSIGNED
   ONGOING_TRANSACTION_FULL_SIZE_BYTES: 0
ONGOING_TRANSACTION_APPLIED_SIZE_BYTES: 0
*************************** 4. row ***************************
                          CHANNEL_NAME: group_replication_applier
                             WORKER_ID: 3
                             THREAD_ID: 65
              ONGOING_TRANSACTION_TYPE: UNASSIGNED
   ONGOING_TRANSACTION_FULL_SIZE_BYTES: 0
ONGOING_TRANSACTION_APPLIED_SIZE_BYTES: 0

Automatic eviction & rejoin

The Group Replication resource manager now provides auto-eviction functionality, which we can configure using the available options. This basically ensures that the unhealthy node is removed from the Group to maintain the cluster’s high availability and overall performance.

This requires installing the “group replication resource manager” component.

mysql> INSTALL COMPONENT 'file://component_group_replication_resource_manager';

Once the component is available, we can use various options to decide the node expulsion policy.

1) Applier channel

We can set the applier channel replication lag threshold values using the configuration parameter below.

mysql> set global group_replication_resource_manager.applier_channel_lag = ;

If lag exceeds “applier_channel_lag” threshold 10 times or more in a row, this server is expelled from the group. The status variable below is used for tracking the lag exceed rate.

mysql> show global status like 'Gr_resource_manager_applier_channel_lag';
+-----------------------------------------+-------+
| Variable_name                           | Value |
+-----------------------------------------+-------+
| Gr_resource_manager_applier_channel_lag | 0     |
+-----------------------------------------+-------+

2) Recovery Channel

Similarly, we can define a threshold for the group member recovery process to attempt to rejoin the cluster.

mysql> set global group_replication_resource_manager.recovery_channel_lag = ;

If the secondary’s recovery lag exceeds “recovery_channel_lag”, 10 times or more in succession, the server is expelled from the group.

mysql show global status like 'Gr_resource_manager_recovery_channel_lag';
+------------------------------------------+-------+
| Variable_name                            | Value |
+------------------------------------------+-------+
| Gr_resource_manager_recovery_channel_lag | 0     |
+------------------------------------------+-------+

3) Memory/Resource Usage

We can also define an expelled condition based on the group member’s memory or resource usage %.

mysql> set global group_replication_resource_manager.memory_used_limit = 10;

If the memory usage exceeds memory_used_limit % by 10 or more consecutive times, the node will be expelled from the group.

mysql> show global status like 'Gr_resource_manager_memory_used%';
+---------------------------------+-------+
| Variable_name                   | Value |
+---------------------------------+-------+
| Gr_resource_manager_memory_used | 78    |
+---------------------------------+-------+
1 row in set (0.002 sec)

In addition to the discussed options above, we can also track various server status variables to monitor group replication and the resource manager component.

mysql> select * from performance_schema.global_status where variable_name in ('Gr_resource_manager_applier_channel_threshold_hits','Gr_resource_manager_applier_channel_eviction_timestamp','Gr_resource_manager_recovery_channel_threshold_hits','Gr_resource_manager_recovery_channel_eviction_timestamp','Gr_resource_manager_memory_threshold_hits','Gr_resource_manager_memory_eviction_timestamp');
+---------------------------------------------------------+----------------+
| VARIABLE_NAME                                           | VARIABLE_VALUE |
+---------------------------------------------------------+----------------+
| Gr_resource_manager_applier_channel_eviction_timestamp  |                |
| Gr_resource_manager_applier_channel_threshold_hits      | 0              |
| Gr_resource_manager_memory_eviction_timestamp           |                |
| Gr_resource_manager_memory_threshold_hits               | 6703           |
| Gr_resource_manager_recovery_channel_eviction_timestamp |                |
| Gr_resource_manager_recovery_channel_threshold_hits     | 0              |
+---------------------------------------------------------+----------------+
6 rows in set (0.003 sec)

The expelled node can attempt to automatically rejoin based on the value of the group_replication_autorejoin_tries variable.

mysql> show variables like '%group_replication_autorejoin_tries%';
+------------------------------------+-------+
| Variable_name                      | Value |
+------------------------------------+-------+
| group_replication_autorejoin_tries | 3     |
+------------------------------------+-------+
1 row in set (0.006 sec)

If the node cannot join, it will perform the behaviour specified in the group_replication_exit_state_action variable.

mysql> show variables like '%group_replication_exit_state_action%';
+-------------------------------------+--------------+
| Variable_name                       | Value        |
+-------------------------------------+--------------+
| group_replication_exit_state_action | OFFLINE_MODE |
+-------------------------------------+--------------+
1 row in set (0.005 sec)

After a server is evicted from the group (for whatever reason), it gets a grace period (group_replication_resource_manager) when it rejoins. During this period, the Resource Manager won’t immediately kick it out again, even if it’s still lagging or breaching the defined threshold as discussed above.

mysql> show variables like '%group_replication_resource_manager.quarantine_time%';
+----------------------------------------------------+-------+
| Variable_name                                      | Value |
+----------------------------------------------------+-------+
| group_replication_resource_manager.quarantine_time | 3600  |
+----------------------------------------------------+-------+

Up-to-date aware Primary election

The Primary election process is more mature and cohesive. The Group Replication Manager now uses the most up-to-date status as a criterion for selecting the new primary.

Here is how the Group Replication Manager performs the most up-to-date primary selection prior to MySQL v9.7.

The lowest MySQL version is checked for each member.
If more than one member is running the lowest MySQL Server version, each member’s weight is determined by the “group_replication_member_weight” system variable.
If there is more than one member running the lowest MySQL Server version, and also more than one of those members has the highest member weight, the third factor considered is the lexicographical order of the generated server UUIDs “server_uuid” of each group member. The member with the lowest server UUID is chosen as the new primary.

In MySQL version 9.7, “group_replication_elect_prefers_most_updated” was introduced, so the failover will be determined by how many transactions are in the secondary backlog. Basically the secondary with the least backlog will be selected as Primary.

Now, it will consider the “most up-to-date” node first, then “weight” and then “UUID”.

To use “group_replication_elect_prefers_most_updated”, we need to install the “Group Replication Primary Election” component listed below on each Group Member.

mysql> Install component 'file://component_group_replication_elect_prefers_most_updated';

By default, the most up-to-date group member selection is enabled. We need to make sure it’s enabled on all Group Members.

mysql> select @@group_replication_elect_prefers_most_updated.enabled;
+--------------------------------------------------------+
| @@group_replication_elect_prefers_most_updated.enabled |
+--------------------------------------------------------+
|                                                      1 |
+--------------------------------------------------------+
1 row in set (0.007 sec)

In the event that a new primary is elected via the most up-to-date selection mechanism, this metric represents the transaction processing differential between the newly designated primary and the secondary node with the highest level of synchronisation.

mysql> show status like 'Gr_latest_primary_election_by_most_uptodate_members_trx_delta';
+---------------------------------------------------------------+-------+
| Variable_name                                                 | Value |
+---------------------------------------------------------------+-------+
| Gr_latest_primary_election_by_most_uptodate_members_trx_delta | 0     |
+---------------------------------------------------------------+-------+

Also, we can track the timestamp of the most recent primary election on the most up-to-date node.

mysql> show status like 'Gr_latest_primary_election_by_most_uptodate_member_timestamp';
+--------------------------------------------------------------+-------+
| Variable_name                                                | Value |
+--------------------------------------------------------------+-------+
| Gr_latest_primary_election_by_most_uptodate_member_timestamp |       |
+--------------------------------------------------------------+-------+
1 row in set (0.005 sec)

The database logs also tell exactly what criteria the primary member selected during failover.

2026-06-14T10:04:02.243809Z 0 [System] [MY-015575] [Repl] Plugin group_replication reported: 'Member with uuid 00021702-2222-2222-2222-222222222222 was elected primary since it was the most up-to-date member with 2755 transactions more than second most up-to-date member 00021703-3333-3333-3333-333333333333. In case of a tie member weight and then uuid lexical order was used over the most updated members.'

MySQL JSON duality views

With the introduction of JSON duality views, we can leverage a single unified JSON document for both relational and hierarchical JSON data. This provides a common, structured JSON format for the application, allowing it to perform both read and write operations.

Let’s see a quick scenario below on how it works.

Below are two relational tables from which we obtain aggregated information in JSON format.

mysql> CREATE TABLE products (
  product_id INT PRIMARY KEY,
  product_type VARCHAR(100)
);

mysql> CREATE TABLE products_details (
  product_detail_id INT PRIMARY KEY,
  product_id INT,
  name VARCHAR(100),
  active varchar(10)
);

mysql> INSERT INTO products (product_id,product_type) VALUES (1,'IT'), (2,'TEL');
mysql> INSERT INTO products_details (product_detail_id,product_id,name,active) VALUES (1,1,'Laptop','Yes'), (2,2,'Mobile','Yes');

Here is the exact Json View which fetch the columns from the relation table based on the join condition. Each of those relational table columns is mapped with a JSON data structure (_id,v_product_type,v_product_type ), and the complete details of the product details table are fetched into the (product) array.

mysql> CREATE JSON RELATIONAL DUALITY VIEW view_product AS
SELECT JSON_DUALITY_OBJECT( WITH(INSERT,UPDATE,DELETE)
    '_id': product_id,
    'v_product_type': product_type,
    'product': (
        SELECT JSON_ARRAYAGG(
            JSON_DUALITY_OBJECT(WITH(INSERT,UPDATE,DELETE)
                'v_product_detail_id': product_detail_id,
                'v_name': name,
                'v_active': active
                
            )
        )
        FROM products_details
        WHERE products_details.product_id = products.product_id
    )
)
FROM products;

mysql> select * from view_product;
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| data                                                                                                                                                                           |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| {"_id": 1, "product": [{"v_name": "Laptop", "v_active": "Yes", "v_product_detail_id": 1}], "_metadata": {"etag": "313642c2aa24f0571264332afa140715"}, "v_product_type": "IT"}  |
| {"_id": 2, "product": [{"v_name": "Mobile", "v_active": "Yes", "v_product_detail_id": 2}], "_metadata": {"etag": "3d229ada02ac660f9f6cac994b44831a"}, "v_product_type": "TEL"} |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.002 sec)

Once the duality view is created, we can perform both read/write operations.

Reading the duality view

mysql> select * from view_product;
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| data                                                                                                                                                                           |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| {"_id": 1, "product": [{"v_name": "Laptop", "v_active": "Yes", "v_product_detail_id": 1}], "_metadata": {"etag": "313642c2aa24f0571264332afa140715"}, "v_product_type": "IT"}  |
| {"_id": 2, "product": [{"v_name": "Mobile", "v_active": "Yes", "v_product_detail_id": 2}], "_metadata": {"etag": "3d229ada02ac660f9f6cac994b44831a"}, "v_product_type": "TEL"} |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Writing the underlying table in the duality view

mysql> UPDATE view_product
SET data = JSON_SET(
    data,
    '$.product[0].v_name',
    'Notepad'
)
WHERE JSON_EXTRACT(data, '$._id') = 1;

mysql> select * from products_details;
+-------------------+------------+---------+--------+
| product_detail_id | product_id | name    | active |
+-------------------+------------+---------+--------+
|                 1 |          1 | Notepad | Yes    |
|                 2 |          2 | Mobile  | Yes    |
+-------------------+------------+---------+--------+

After performing the above write operations, we can see that the view now shows the updated data.

mysql > select * from view_product;
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| data                                                                                                                                                                           |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| {"_id": 1, "product": [{"v_name": "Notepad", "v_active": "Yes", "v_product_detail_id": 1}], "_metadata": {"etag": "72c4368420cdc698842d0ab4bd9315ab"}, "v_product_type": "IT"} |
| {"_id": 2, "product": [{"v_name": "Mobile", "v_active": "Yes", "v_product_detail_id": 2}], "_metadata": {"etag": "3d229ada02ac660f9f6cac994b44831a"}, "v_product_type": "TEL"} |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

Hypergraph Optimizer

With the Hypergraph Optimiser, we now have more advanced optimisation for complex queries and a broader set of Join plans than the older traditional method, missing earlier. By using “Join hypergraph”, the optimiser now has better reach to all tables in the join condition.

Hypergraph Optimiser is OFF

mysql> SET optimizer_switch='hypergraph_optimizer=off';

mysql> SELECT t1.k, COUNT(*) AS cnt
FROM sbtest1 t1
JOIN sbtest2 t2 ON t1.id = t2.id
JOIN sbtest3 t3 ON t1.id = t3.id
WHERE t1.k BETWEEN 200000 AND 500000
GROUP BY t1.k
ORDER BY cnt DESC
LIMIT 100;

Output:

| 498870 | 119 |
| 498729 | 119 |
| 497668 | 119 |
| 498076 | 119 |
+--------+-----+
100 rows in set (4.000 sec)

Explain output:

-> Limit: 100 row(s)
    -> Sort: cnt DESC, limit input to 100 row(s) per chunk
        -> Stream results  (cost=1.22e+6 rows=175136)
            -> Group aggregate: count(0)  (cost=1.22e+6 rows=175136)
                -> Nested loop inner join  (cost=1.1e+6 rows=493200)
                    -> Nested loop inner join  (cost=601547 rows=493200)
                        -> Filter: (t1.k between 200000 and 500000)  (cost=99122 rows=493200)
                            -> Covering index range scan on t1 using k_1 over (200000 <= k <= 500000)  (cost=99122 rows=493200)
                        -> Single-row covering index lookup on t2 using PRIMARY (id = t1.id)  (cost=0.919 rows=1)
                    -> Single-row covering index lookup on t3 using PRIMARY (id = t1.id)  (cost=0.919 rows=1)

Hypergraph Optimiser is ON

mysql> SET optimizer_switch='hypergraph_optimizer=on';

mysql> SELECT t1.k, COUNT(*) AS cnt
FROM sbtest1 t1
JOIN sbtest2 t2 ON t1.id = t2.id
JOIN sbtest3 t3 ON t1.id = t3.id
WHERE t1.k BETWEEN 200000 AND 500000
GROUP BY t1.k
ORDER BY cnt DESC
LIMIT 100;

Output:

| 499721 | 119 |
| 499052 | 119 |
| 498870 | 119 |
| 498384 | 119 |
+--------+-----+
100 rows in set (0.498 sec)

Explain output:

-> Sort: cnt DESC, limit input to 100 row(s) per chunk  (cost=1.96e+6..1.96e+6 rows=100)
    -> Table scan on   (cost=1.87e+6..1.9e+6 rows=175136)
        -> Aggregate using temporary table  (cost=1.87e+6..1.87e+6 rows=175136)
            -> Inner hash join (t2.id = t3.id)  (cost=990754..1.44e+6 rows=493200)
                -> Covering index scan on t3 using k_1  (cost=0.312..308240 rows=986400)
                -> Hash
                    -> Inner hash join (t1.id = t2.id)  (cost=370988..824021 rows=493200)
                        -> Covering index scan on t2 using k_1  (cost=0.312..308240 rows=986400)
                        -> Hash
                            -> Filter: (t1.k between 200000 and 500000)  (cost=0.416..205287 rows=493200)
                                -> Covering index range scan on t1 using k_1 over (200000 <= k <= 500000)  (cost=0.359..176877 rows=493200)

We can see that with “hypergraph_optimizer=enabled”, the query execution time is almost 8x faster.

The performance difference might not be noticeable with a few joins or a smaller table’s data set, but with more complex joins, it can yield better performance. In the above example, we can see that when “hypergraph_optimizer=enabled”, the optimiser replaces “Nested loop inner join” with “Inner hash join”, which is generally better for large datasets.

Higher version source allowed

Now, it’s possible that a lower version replica can connect to a higher version source when the major versions differ. That means we don’t have to rely on all replicas being upgraded in one go; we can just upgrade the source, verify it, and later perform rolling upgrades on lower-version replicas as per our own timelines and convenience.

Of course, we have to be cautious not to run any such feature or change on the source that doesn’t support lower-version replicas.

Please note – This won’t be applicable to previous releases, say (8.4, 8.0), as they didn’t restrict such replication connectivity. It would be useful for 9.7 or the next major release.

To enable this functionality, we need to ensure the following variable is enabled on the Replica. By default its enabled on 9.7

mysql> show variables like 'replica_allow_higher_version_source';
+-------------------------------------+-------+
| Variable_name                       | Value |
+-------------------------------------+-------+
| replica_allow_higher_version_source | ON    |
+-------------------------------------+-------+
1 row in set (0.008 sec)

Summary

The above discussion highlights key advancements in MySQL 9.7 LTS, ranging from some innovative or operational improvements to developer-centric features such as “JSON Duality” Views. Also, the “Hypergraph Optimiser” is now available for community release, which was previously exclusive to MySQL Heatwave/Enterprise. As a Long-Term Support (LTS) release, MySQL 9.7 is structured to provide a stable and consistent environment, prioritising architectural reliability over frequent experimental changes.

One more important mention here: It’s suggested to use MySQL 9.7.1, or the next sub-releases, as 9.7.0 has some higer severity CVE’s. If you are using Percona Server for MySQL (PS), we skipped 9.7.0 and are shipping the fixed 9.7.1 version directly.

Still, it’s highly recommended to test any new component or changes in your lower/staging environment before deploying in production to better assess the overall impact on existing workload, queries, and database behaviour.

The post Inside MySQL 9.7 LTS Features appeared first on Percona.

Commit Optimization for Large MySQL Transactions

Libing Song — Wed, 15 Jul 2026 02:00:00 +0000

This article is also available in Chinese: 中文版. Browse all English articles.

If you use and operate MySQL, you’ve surely run into a strange slow query like this:

An INSERT that’s normally instant took 1.3s, and the slow-query log shows no long lock wait.
Every statement in a multi-statement transaction had already finished, yet the COMMIT alone took 1.3s.

When this happens, the most likely cause is a large transaction committing. Below is a simulated test: we used sysbench to simulate a normal workload, then ran a large UPDATE in the background every 5 seconds. You can see the large UPDATE severely hurts performance.

Root Cause

The figure above shows the execution of two transactions:

A transaction runs in two phases: an execution phase and a commit phase.
During execution, when a statement updates data it generates binlog events. These are stored in the Binlog Cache, which has two parts: an in-memory buffer and a temporary file. When the buffer fills up, the events are written to the temporary file.
At commit, all the binlog events in the Binlog Cache are copied into the binlog file.
Writing binlog events to the binlog file must be serialized — one transaction can’t do it until the previous one has finished. So while Trx_n is writing to the binlog file, Trx_m has to wait.
In the figure, Trx_n is a large transaction that produced a lot of binlog events. The time to copy binlog events into the binlog file is linear in the size of the events the transaction produced — the more events, the longer the copy takes.
Trx_m is a small transaction. Even though its execution phase finished quickly, at commit it runs into the large transaction Trx_n committing, so it must wait for Trx_n to finish copying its binlog events before it can proceed. Trx_m spends most of its commit phase waiting for Trx_n to write the binlog file — and that’s why the small transaction becomes slow.

How Serious the Problem Is

As our simulated test shows, committing a large transaction has a major impact on workload stability. In real-world scenarios it can be far worse, and it’s common.

A GB-scale transaction can make the instance unwritable for a long time. Since storage IO bandwidth is fixed, the time to write a large transaction’s binlog depends on the transaction’s size. The largest transaction we’ve seen in production produced 104 GB of binlog events.
A GB-scale transaction can push IO throughput up and slow it down, or even saturate IO, which also slows queries.
A few-hundred-MB transaction won’t cause a long outage, but it can still add hundreds of milliseconds to application DML. For latency-sensitive workloads, even that may be unacceptable.
On top of this, all of the above can raise the number of active connections. If those active connections aren’t cleared in time, CPU spikes, and it can turn into a vicious cycle — eventually an avalanche and a much bigger problem.

Optimizing How Large Transactions Write the Binlog

In AliSQL we optimized how a large transaction writes the binlog, completely eliminating the stability impact of large-transaction commit. RDS 5.7 and RDS 8.0 both enable this optimization by default. Last year we contributed it to MariaDB, and the feature shipped in MariaDB 11.7¹.

The Approach

Here is the implementation in MariaDB 11.7. MySQL and MariaDB have diverged quite a bit in code, but the underlying logic — and therefore the approach — is the same.

The idea is simple and clean: since the Binlog Cache has already written the binlog events to a file, we just rename that file directly into a binlog file. This avoids copying the binlog events, so there is no extra IO. And a rename takes constant time regardless of the Binlog Cache’s size, which fully solves the large-transaction problem. Let’s look at the implementation.

The #binlog_cache_files Directory

The Binlog Cache’s file is a system temporary file, which can’t be renamed into a regular file directly. So we create a directory, #binlog_cache_files, in the binlog directory; the file the Binlog Cache creates then becomes a regular file in this directory instead of a system temp file.

$ls var/mysqld.1/data/#binlog_cache_files
ML_140413554102520

Reserving Head Space

The Binlog Cache file contains only the transaction’s binlog events. To turn it into a binlog file, we need to reserve some space for the binlog header events, such as the Format_description_event.

The reserved space is 4 KB-aligned, so at least 4 KB is reserved, which is enough in most cases. But in some situations the Gtid_list_log_event (similar to MySQL’s Previous_gtids_event, recording the GTID set generated before this binlog) can be very large. To keep the feature usable in that case, when generating a new binlog file we adjust the reserved space based on how much the header events actually occupy; the Binlog Cache file’s reserved space is then adjusted when the next transaction begins. The binlog header events usually take less than 4 KB, so after writing them some space may be left over. How do we handle the leftover? Thanks to MariaDB’s mechanism of padding a Gtid_log_event with trailing zeros, the leftover space is absorbed into the corresponding Gtid_log_event. After the Binlog Cache file is turned into a binlog file, its structure looks like this:

The Rename Process

The rename works roughly as follows:

Persist the Binlog Cache file. At this point the rename hasn’t started, so it doesn’t block other transactions from committing.
Perform a rotate: close the current binlog file and create a new one.
Copy the new file’s header into the head of the Binlog Cache.
Generate the Gtid_log_event.
Delete the newly generated binlog file, and rename the Binlog Cache file into the new binlog file.

Results

Again we used sysbench to simulate a workload, then ran a large UPDATE in the background every 5 seconds, each producing 512 MB of binlog events. The results:

With the large-transaction commit optimization, sysbench’s TPS is fairly steady, with no violent swings. There’s still a small dip every 5 seconds, but that comes from the large UPDATE itself using some CPU, not from transaction commit.

We also simulated the DML latency caused by transactions of different sizes. The results:

Without the optimization, once a large transaction exceeds 64 MB, sysbench’s max latency starts to climb noticeably, and rises rapidly as the transaction grows.
With the optimization on, sysbench’s max latency stays stable no matter how large the transaction, holding at normal workload levels. At 1024 GB, one extra binlog rotate adds a slight bump in latency.

Conclusion

In MySQL’s binlog replication architecture, large transactions are a classic trigger for trouble, causing stability and replication-lag problems. By renaming the Binlog Cache’s temporary file directly into a binlog file, we avoid copying binlog events and eliminate the extra IO, keeping large-transaction commit fast and stable — and fully resolving the various stability problems that large-transaction commit causes.

MariaDB 11.7 — Binlog Commit Optimization for Large Transactions ↩︎

MyDumper Locking Mechanisms Revisited: Introducing SAFE_NO_LOCK

MySQL Performance Blog — Mon, 13 Jul 2026 12:24:41 +0000

About a year ago, we discussed how MyDumper refactored its locking mechanisms to move away from old, rigid flags and transitioned towards more flexible, streamlined execution. Since then, the MyDumper community hasn’t stood still.

In recent releases, the locking architecture was further standardized under a single overarching option: --sync-thread-lock-mode. Along with this modernization came a powerful new safety feature designed to give you lock-free thread synchronization without risking silent inconsistency: SAFE_NO_LOCK (merged in PR #2031).

Let’s explore the new thread-synchronization landscape and break down when you should use each mode.

What is `--sync-thread-lock-mode`?

Previously, flags like -k, --no-locks or --lock-all-tables dictated how MyDumper behaved. These have now been deprecated in favor of --sync-thread-lock-mode, which accepts five core values: AUTO, FTWRL, LOCK_ALL, GTID, NO_LOCK, and the newly added SAFE_NO_LOCK.

As a multi-threaded tool, MyDumper’s main challenge is ensuring that every single worker thread establishes its database snapshot at the exact same point in time. The sync mode you choose completely alters how MyDumper orchestrates this point-in-time synchronization.

Understanding SAFE_NO_LOCK

MyDumper fires off START TRANSACTION WITH CONSISTENT SNAPSHOT across its threads. It captures the binary log position at the very beginning of the process and compares it after the worker threads have attempted to synchronize.

When using NO_LOCK, if the threads don’t actually hit the same point in time—meaning they fail to synchronize—MyDumper simply logs a warning and continues backing up. This results in an inconsistent backup, which is a massive gamble for production systems.

SAFE_NO_LOCK adds a strict transactional safety net. If MyDumper detects any differences or drift in the binlog position among the threads during the synchronization phase, it immediately stops the backup. This prevents you from generating a corrupted, out-of-sync backup that will fail or cause data anomalies during a later restore.

Choosing the Right Mode

Depending on your architecture, uptime requirements, and database vendor, here is the breakdown of when to use each mode:

AUTO (The Default)

What it does: MyDumper automatically evaluates the database vendor, version, and capabilities to choose the safest, least-intrusive method.

When to use it: The vast majority of standard backups. It removes the guesswork and adapts dynamically if your database infrastructure upgrades.

FTWRL (Flush Tables With Read Lock)

What it does: It is the traditional method. It issues a global read lock via FLUSH TABLES WITH READ LOCK on the main connection, forces all threads to establish their consistent snapshot at that exact freeze frame, and then releases the lock.

When to use it:

When you have non-transactional tables (like MyISAM or ARCHIVE) that must be consistently backed up alongside InnoDB tables.
When your database lacks advanced snapshot-tracking capabilities (older MySQL versions).

Downside: It blocks writes across the entire instance during synchronization, which can cause a queue cascade on a busy production server.

GTID

Leverages a specific server variable in Percona Server called binlog_snapshot_gtid_executed to instantly verify if all threads are watching the exact same transaction state.

When to use it: If you are running Percona Server with GTID enabled and want a lightning-fast, lockless synchronization method that is guaranteed to be transactionally accurate.

SAFE_NO_LOCK

What it does: Uses transaction isolation to sync threads without global locks, but immediately aborts the backup if binlog positions diverge during initialization.

When to use it:

On highly sensitive production systems, where global write locks are absolutely forbidden due to strict SLAs.
When you are entirely utilizing transactional engines (InnoDB).
When you want a lock-free backup but require absolute certainty that your backup is 100% consistent.

Downside: In high-throughput write environments, threads may fail to align within the retry window, causing the backup job to abort. (Though an abort is always preferable to an inconsistent backup!).

NO_LOCK

What it does: Attempts lockless synchronization but logs a warning and proceeds even if consistency fails.

When to use it: Rarely, if ever, in the production primary server. It is acceptable for staging environments, development seeding, or scratch pads where data accuracy and point-in-time consistency are entirely secondary to getting a quick data dump without locking the server.

LOCK_ALL

What it does: Explicitly issues a LOCK TABLE command for every single table being exported.

When to use it: Primarily a fallback mode. Use this only when FLUSH TABLES WITH READ LOCK is completely unavailable due to restricted cloud permissions (certain restricted PaaS environments) or specific database limitations.

Conclusion

The addition of --sync-thread-lock-mode=SAFE_NO_LOCK bridges a long-standing gap in logical MySQL backups: achieving a completely lockless synchronization state without flying blind. By implementing a strict fail-fast policy, MyDumper ensures that database administrators never have to sacrifice backup integrity for system availability.

The post MyDumper Locking Mechanisms Revisited: Introducing SAFE_NO_LOCK appeared first on Percona.

Dynamic Data Masking (DDM) with MySQL Enterprise Edition 9.7: Reduce your sensitive data exposure.

Oracle MySQL Group — Mon, 13 Jul 2026 11:48:33 +0000

With the new LTS (Long Term Support) release of MySQL 9.7.0 https://dev.mysql.com/doc/relnotes/mysql/9.7/en/ , Dynamic Data Masking (DDM) is one of the new features introduced as part of Enterprise Edition. The recent blog by Mike Frank, MySQL Product Management Director, details why DDM is important in every industry where PII (Personal Identifiable Information) data is stored […]

InnoDB Flushing is simple – explained

Kedar Vaijanapurkar — Mon, 13 Jul 2026 05:15:00 +0000

As a junior once I asked a seasoned MySQL DBA (Abuelo) “How do you stay so calm in critical situations?”Abuelo DBA then uttered golden words: “Son, I keep my dirty…

The post InnoDB Flushing is simple – explained first appeared on Change Is Inevitable.

Running DuckDB as a MySQL 9.7 storage engine

MySQL Performance Blog — Fri, 10 Jul 2026 13:38:16 +0000

ducksdb-mysql-engine is an experimental build of MySQL 9.7 where a table you mark ENGINE=DuckDB answers analytical queries from DuckDB instead of InnoDB. Same server, same connection, no second copy of the data. On TPC-H at scale factor 10, InnoDB times out on 6 of the 22 queries and burns 1317 seconds on the 16 it finishes. The DuckDB tables run all 22 in about 15 seconds.

It’s an experiment, not production software. It patches mysqld and has rough edges, which we list at the end. Source is on GitHub under GPLv2: https://github.com/Percona-Lab/ducksdb-mysql-engine.

Why we made it

MySQL is great for transactions and slow at analytics. A wide GROUP BY over a few hundred million rows, or a six-way join, takes minutes on InnoDB. The usual fix is to copy the data into a column store and keep it in sync, so now you’re running two systems and the pipeline between them.

We wanted the table itself to be the column store, with the heavy queries offloaded for you. Mark it ENGINE=DuckDB, query it the way you always have, and DuckDB does the analytical work.

What it actually is

DuckDB is an in-process columnar query engine, basically SQLite for OLAP. It stores data by column and it’s built for scans and aggregations, which is exactly what a row store is bad at.

We’re not the first to put it behind a relational table. Alibaba’s AliSQL has had a built-in DuckDB engine for a while. MariaDB shipped MariaDB DuckDB recently. AliSQL got there first; MariaDB and we landed on the same idea independently, around the same time, them on MariaDB and us on stock MySQL 9.7. Their engine is the closest comparison to ours, so it’s in the benchmarks below.

How it hooks into MySQL

MySQL doesn’t have a select_handler, the API MariaDB uses to grab a whole SELECT and run it inside an engine. We added our own: a handlerton::pushdown_select hook.

The pushdown path. Either the whole query renders to DuckDB SQL and runs columnar, or it declines and the normal row path handles it.

The engine is compiled into mysqld, and each schema is one DuckDB file under the datadir. Three patches do the integration, and all three are generic, so they’ll fire for any engine that exposes the hook:

The hook runs at the end of JOIN::optimize(). If every base table in the block is one engine that has the hook, that engine looks at the optimized JOIN, and if it can translate the whole query it sets JOIN::override_executor_func (which the executor already checks in sql_union.cc). The query gets regenerated as DuckDB SQL, prepared once, run, and the aggregated result is staged into a temp table. EXPLAIN is left alone.
A server-side LOAD DATA INFILE goes into a DuckDB COPY instead of crawling through write_row row by row. At 600M rows that’s a 20-minute load instead of 80.
For single-engine statements we clear OPTIMIZER_SWITCH_SEMIJOIN in prepare, so IN, EXISTS, NOT IN and NOT EXISTS stay as subqueries the builder can render instead of getting rewritten into semijoin nests it can’t recognize.

The builder only renders a node when the output is provably identical to what MySQL would return. If it can’t, it declines and MySQL runs the query unchanged. Literals are bound as parameters. Collation, NULL ordering and decimal scale are matched on purpose, and an unmapped collation or a REAL literal is enough to make it back off. With that in place, all 22 TPC-H queries push down and match InnoDB row for row.

Getting started

The fastest way in is the image:

  docker run -d --name mysql-duckdb -p 3306:3306 \ 
   -e MYSQL_ROOT_PASSWORD=secret \ 
   -v mysql-duckdb-data:/var/lib/mysql \ 
   perconalab/ducksdb-mysql-engine:9.7-duckdb-v0.2.2

Make a table, put a few rows in, run an aggregate:

CREATE DATABASE shop; USE shop;
CREATE TABLE sales (id INT PRIMARY KEY, region INT, amount DECIMAL(12,2)) ENGINE=DuckDB;
INSERT INTO sales VALUES (1,1,100),(2,1,200),(3,2,50);

SELECT region, SUM(amount) FROM sales GROUP BY region;
-- region | SUM(amount)
-- 1 | 300.00
-- 2 | 50.00

Nothing about that query is special, and that is the point. To check it actually went to DuckDB rather than down the row path, watch the Ducksdb_pushdown_count status variable:

SELECT region, SUM(amount) FROM sales GROUP BY region; -- offloaded
SHOW STATUS LIKE 'Ducksdb_pushdown_count'; -- counter goes +1

SELECT * FROM sales WHERE id = 3; -- point lookup
SHOW STATUS LIKE 'Ducksdb_pushdown_count'; -- counter unchanged

The single-row lookup stays on the row path deliberately. For one row an index seek beats spinning up a DuckDB result, so there is no reason to offload it. OLTP keeps its path, analytics get the column store, and you do not pick by hand.

If you would rather build it, you need the MySQL 9.7 tree under vendor/mysql-server/ and a DuckDB prefix, then:

ln -s ../../engine vendor/mysql-server/storage/duckdb
scripts/build-server.sh # applies the 3 patches, builds mysqld + clients

Does it actually go fast?

All 22 TPC-H queries, were executed in a Docker on one laptop (20 cores, 62 GiB RAM), the same data loaded into four engines: InnoDB, our MySQL+DuckDB, MariaDB+DuckDB, and standalone DuckDB as the reference. Warm wall-clock, minimum over a few runs, in seconds.

SF10, around 60 million lineitem rows

A handful of rows here; the whole table is in docs/tpch_engine_comparison.md:

Query	InnoDB	MySQL+DuckDB	MariaDB +DuckDB	Native DuckDB
Q1	>180	1.77	0.84	0.77
Q5	127.6	0.53	0.46	0.71
Q7	145.0	0.45	0.38	0.67
Q9	>180	1.52	2.30	1.78
Q18	101.7	1.35	1.39	1.31
Q19	120.4	0.15	0.67	0.83
All 22	Finished 16/22	15.1s	13.3s	16.3

SF10, all 22 queries, log scale (lower is better). Hatched InnoDB bars did not finish inside 180 s. The three DuckDB engines sit in a tight band near the floor.

InnoDB is somewhere between 100 and 340 times slower per query, and on 6 of the 22 it never finished inside the 180-second cap (the correlated subqueries and the heaviest scans). It burned 1317 seconds on just the 16 it did finish. The three DuckDB engines get through all 22 in about 15 seconds, and ours lands right between MariaDB and plain DuckDB. The gap is so big there is not much else to say about it.

SF100, around 600 million lineitem rows

At this size InnoDB is out of the running (a copy of the data alone is about 100 GB and queries run for hours), so it is the three DuckDB engines only, run one at a time:

Query	MySQL+DuckDB	MariaDB+DuckDB	Native DuckDB
Q1	15.25	6.50	5.50
Q9	20.97	115.29	19.54
Q10	10.64	ERR	8.14
Q13	19.75	ERR	13.27
Q18	14.88	29.62	11.08
Q19	2.91	7.98	6.61
Correct	22/22	20/22	22/22

SF100, three DuckDB engines, log scale (lower is better). Q15 for our engine is shown at its matched-memory time (~4 s); the capped run measured 1309 s, explained below. MariaDB errored on Q10 and Q13.

At 600 million rows ours is still correct on all 22 and stays close to plain DuckDB. MariaDB’s engine drops two queries (Q10, and Q13 on its column-list syntax) and is a lot slower on the big joins – Q9 took 115 seconds against our 21 and native’s 20.

One honest word on Q15 at SF100, because in the full table it shows an ugly number for our engine. It is not a real loss. We capped DuckDB’s memory so it spills to disk instead of getting OOM-killed inside mysqld, and under that cap Q15’s CTE spills a lot. Give it the memory MariaDB had and it runs in about 4 seconds, like native. The answer was always right; only the clock was bad.

And one number we did not expect: loading those 600 million rows took about 20 minutes with our engine (the COPY shortcut) versus about 80 minutes with MariaDB, which loads row by row on a single core. Roughly four times faster to get the data in.

Try it, then tell us

If any of this sounds useful, pull the image and throw your own queries at it:

docker run -d -p 3306:3306 -e MYSQL_ROOT_PASSWORD=secret \ 
   perconalab/ducksdb-mysql-engine:9.7-duckdb-v0.2.2

The source is on GitHub (GPLv2), patches and benchmark harness included: https://github.com/Percona-Lab/ducksdb-mysql-engine. The full per-query benchmark and how we measured it live in docs/tpch_engine_comparison.md.

The post Running DuckDB as a MySQL 9.7 storage engine appeared first on Percona.

MySQL 9.x: Moving Away From SHA1 and MD5

Oracle MySQL Group — Thu, 09 Jul 2026 13:20:51 +0000

TL;DR If you use MD5(), SHA1(), or SHA() in MySQL today, start planning the move to SHA2(). Beginning with MySQL 9.6, MD5(), SHA1(), and SHA() are no longer native built-in SQL functions in the server binary. They are available through the Legacy Hashing Component: That component should be treated as a stopgap solution. It gives […]

Cross-site Disaster Recovery with Percona Operator for MySQL

Percona Community — Mon, 06 Jul 2026 10:00:00 +0000

A MySQL InnoDB Cluster provides high availability for a single database cluster using Group Replication. This works well for node failures inside the cluster, but disaster recovery usually requires another cluster in a separate location: another Kubernetes cluster, region, data center, or cloud.

This replica cluster needs to stay in sync with the primary, remain protected from accidental writes, and be ready to take over when you need to move traffic, either as a planned operation or during an outage.

InnoDB ClusterSet addresses this by linking multiple MySQL clusters into a single disaster-recovery topology. One cluster handles writes, while the others stay synchronized as read-only replicas.

Starting from v1.2.0, the Percona Operator for MySQL adds a new custom resource, PerconaServerMySQLClusterSet, which allows managing InnoDB ClusterSets. Creating the ClusterSet, adding replicas, switching the primary, and performing a forced failover are all handled declaratively by updating the Kubernetes spec and letting the operator reconcile the desired state.

This post explains how ClusterSet works, how to set it up with the Percona Operator, and how planned switchovers and emergency failovers work in practice.

Understanding InnoDB ClusterSet

Any disaster recovery design usually comes down to two important numbers:

Recovery Point Objective, or RPO, is how much data you can afford to lose. For example, an RPO of five seconds means the business can tolerate losing up to five seconds of writes.
Recovery Time Objective, or RTO, is how long the system can be unavailable before service must be restored.

The way you design and operate a ClusterSet directly affects both. To understand why, it helps to first look at the architecture.

An InnoDB ClusterSet is built from two or more InnoDB Clusters. Each InnoDB Cluster is a Group Replication group. In other words, it is the same kind of highly available MySQL cluster that the Percona Operator for MySQL can already deploy and manage.

A ClusterSet adds another layer on top of those clusters. One cluster is the primary cluster and accepts writes, while the others are replica clusters and remain read-only. The primary sends its changes to each replica using asynchronous replication over a dedicated replication channel.

This gives us two layers of replication, each solving a different problem.

Inside each cluster, Group Replication protects against the loss of individual MySQL nodes. Members are expected to be closer together, usually within the same region or availability zone group. Writes are coordinated by the group, which helps keep the local cluster consistent and highly available.

Between clusters, asynchronous replication protects against the loss of an entire site. Replica clusters can be located in another region, another Kubernetes cluster, or another cloud provider. Because this replication is asynchronous, long-distance network latency does not slow down writes on the primary cluster.

But the tradeoff here is that a replica cluster may be slightly behind the primary. The amount of lag depends on write volume, network latency, and the health of the replication channel. If the primary site is lost, any writes that had not yet reached the replica are lost. That lag is the practical data-loss window during an emergency failover. Any transactions that had not replicated before failover could be lost.

Before building a ClusterSet with the operator, there are a few important requirements to keep in mind:

Every cluster in the ClusterSet must use the Group Replication topology. The operator also supports asynchronous replication with Orchestrator for standalone clusters, but that topology cannot be part of an InnoDB ClusterSet.
You need MySQL 8.0.27 or later
Clusters are linked by network address, not by Kubernetes references. A replica cluster only needs to be reachable and managed by an operator. It does not need to live in the same Kubernetes cluster as the primary.

With the model in place, let’s build a simple cross-site disaster recovery setup.

Setting up ClusterSet

We’ll create the simplest useful ClusterSet: two Group Replication clusters named dc1 and dc2.

In this example: dc1 is the primary cluster. dc2 is the read-only replica cluster.

In a real deployment, these would usually run in separate Kubernetes clusters, regions, or cloud environments. The steps are mostly the same. The main requirement is that the endpoints listed in the ClusterSet spec must be routable between sites.

Creating a primary cluster

The primary cluster dc1 is a regular Group Replication cluster. There is nothing ClusterSet-specific about it at this stage.

yaml
apiVersion: ps.percona.com/v1
kind: PerconaServerMySQL
metadata:
 name: dc1
spec:
 mysql:
 clusterType: group-replication
 # ... the rest of a normal cluster spec

You can find a complete YAML here. Apply it and wait for it to come up the way you normally would, just as you would for any normal Percona Operator-managed MySQL cluster.

Creating the replica cluster

The replica cluster dc2 is also a Group Replication cluster, but with one important difference:

yaml
apiVersion: ps.percona.com/v1
kind: PerconaServerMySQL
metadata:
 name: dc2
spec:
 mysql:
 clusterType: group-replication
 bootstrap:
 mode: manual  # 

Normally, when the operator creates a Group Replication cluster, the first MySQL pod bootstraps the group as soon as it starts. Subsequent pods then join that group.

For a ClusterSet replica, that is not what we want. We do not want dc2 to form an independent empty cluster. Instead, we want it to receive data from the primary cluster and then join the ClusterSet as a replica.

With bootstrap.mode: manual, the first pod starts but does not bootstrap its own Group Replication group. It waits until the ClusterSet process adopts it, clones data from the primary, and then forms the replica cluster. During this stage, the first dc2 pod may remain in a NotReady state until it is a part of the ClusterSet.

Sharing cluster credentials

The operator automatically creates a clusterset MySQL user in every cluster and stores its password in the cluster secret.

The operator uses this user to orchestrate ClusterSet operations, so the password must be the same across all clusters in the ClusterSet. When your clusters are deployed separately, copy the clusterset value from the primary cluster secret into the replica cluster secret before linking them.

For example, if dc1 is the primary, copy the clusterset password from the dc1 secret into the corresponding secret for dc2.

Linking the clusters

Once both clusters are applied, create a PerconaServerMySQLClusterSet custom resource.

yaml
apiVersion: ps.percona.com/v1
kind: PerconaServerMySQLClusterSet
metadata:
 name: my-cluster-set
 finalizers:
 - percona.com/clusterset-dissolve
spec:
 primaryCluster: dc1
 credentialsSecret:
 name: dc1-secrets
 key: clusterset
 sslMode: AUTO
 createReplicaClusterOptions:
 recoveryMethod: clone
 clusters:
 - innodbClusterName: dc1
 endpoints:
 - host: dc1-mysql-primary.default.svc.cluster.local
 - innodbClusterName: dc2
 endpoints:
 - host: dc2-mysql-0.dc2-mysql.default.svc.cluster.local
 mysqlshellRunner:
 image: perconalab/percona-server-mysql-operator:main-psmysql8.4

The most important fields are:

primaryCluster defines which cluster currently accepts writes. The value must match one of the entries under clusters.
clusters lists every member of the ClusterSet and the endpoint the operator should use to reach it. These endpoints are plain network addresses, which is what allows members to run in different Kubernetes clusters or regions.
credentialsSecret points to the secret that contains the clusterset user password.
recoveryMethod: clone tells the replica cluster to take a full copy of the primary data when it joins the ClusterSet. The alternative is an incremental recovery method, which uses existing binary logs instead of cloning the full dataset.
mysqlshellRunner defines the helper pod image used by the operator to run MySQL Shell operations.

After you apply this resource, the operator starts a MySQL Shell runner pod and creates the ClusterSet on dc1. It then joins dc2, which clones the data, starts replication, and brings up the remaining pods in the replica cluster.

At this point, dc1 serves reads and writes, while dc2 acts as a live read-only copy.

Seeding large replica clusters

In this example, the replica cluster is created with recoveryMethod: clone, so MySQL Shell provisions the first replica member by copying a physical snapshot from an existing ClusterSet member. That is convenient for medium/small datasets, but it can be fragile across WAN links or very large databases.

A full clone can take hours, consume significant bandwidth, add load to the donor, run into network interruptions, and become expensive to retry if the operation fails partway through. It can also not be the best fit when the primary is busy or when cross-region egress cost is a concern.

The operator makes it possible to seed the replica cluster from an existing backup of the primary cluster instead. Create a PerconaServerMySQLBackup on the primary, restore that backup into the replica cluster with PerconaServerMySQLRestore, and then add the replica to the ClusterSet using recoveryMethod: incremental. You can find the exact restore procedure in the documentation.

At that point, the replica already has the primary’s data and GTID history, so ClusterSet only needs to catch it up from the primary’s binary logs instead of transferring the full dataset again.

Verifying it worked

The simplest way to confirm that the ClusterSet is working is to write data to the primary cluster and read it from the replica.

For example:

Connect to dc1.
Create a test table or insert a row.
Connect to dc2.
Confirm that the same data appears there.

If the row appears on dc2, the asynchronous replication channel is running and the replica cluster is receiving changes from the primary.

Planned Switchover

A planned switchover is used when both clusters are healthy and you intentionally want to move writes from one site to another. This is useful for regional maintenance, Kubernetes cluster upgrades, cloud migrations, or controlled DR testing.

To move the primary role from dc1 to dc2, update the primaryCluster field:

shell
kubectl patch ps-clusterset my-cluster-set --type=merge \
 -p '{"spec":{"primaryCluster":"dc2"}}'

The operator notices that the desired primary cluster no longer matches the current primary. It then uses MySQL Shell to perform a clean switchover.

Because both clusters are available, the operator can make sure the replica has caught up before changing roles. After the switchover completes, dc2 becomes the writable primary and dc1 becomes a read-only replica.

Emergency Failover

An emergency failover can be used when the primary cluster is unreachable and a clean handover is no longer possible.

This is the disaster recovery case: the Kubernetes cluster, region, or network path to the primary may be down, and you need to promote a surviving replica so the application can resume writes.

To fail over to dc2, update primaryCluster and explicitly set the forced failover flag:

shell
kubectl patch ps-clusterset my-cluster-set --type=merge \
 -p '{"spec":{"primaryCluster":"dc2","unsafeFlags":{"forcedFailover":true}}}'

The operator only follows this path when it can confirm that the current primary cluster is unreachable. It then promotes dc2, allowing it to accept writes.

The explicit flag is important because failover can cause data loss. Replication between clusters is asynchronous, so any writes that reached the old primary but had not yet replicated to dc2 are not present on the new primary. Once dc2 is promoted, those missing writes become unrecoverable through normal ClusterSet recovery.

The risk of data loss is why the field is named unsafeFlags.forcedFailover.

Another important point is that when the old primary comes back, it does not automatically resume as primary. After a forced failover, the recovered cluster must be explicitly reintroduced into the ClusterSet as a replica.

Adding and removing clusters

Adding or removing clusters follows the same declarative pattern: update the custom resource spec and let the operator reconcile the difference.

To add another replica cluster, add a new entry under clusters:

yaml
apiVersion: ps.percona.com/v1
kind: PerconaServerMySQLClusterSet
metadata:
 name: my-cluster-set
spec:
 # .. existing spec
 clusters:
 # .. existing clusters
 - innodbClusterName: dc3
 endpoints:
 - host: dc3-mysql-primary.default.svc.cluster.local

The operator joins the new cluster in the same way it joined dc2: it clones data from the primary, configures replication, and brings the cluster into the ClusterSet as a read-only replica.

To remove a cluster, delete its entry from the clusters list. You can update your manifest and reapply it, or use a JSON patch:

shell
kubectl patch ps-clusterset my-cluster-set --type=json \
 -p '[{"op":"remove","path":"/spec/clusters/1"}]'

If the cluster is healthy, the operator detaches it cleanly and it becomes a normal standalone cluster again.

If the cluster being removed is unreachable, you can force its removal:

shell
kubectl patch ps-clusterset my-cluster-set --type=json -p '[
 {"op":"remove","path":"/spec/clusters/1"},
 {"op":"add","path":"/spec/unsafeFlags/forcedClusterRemoval","value":true}
]'

Like forced failover, forced removal is gated behind an unsafe flag because the operator should not make this decision silently. Removing an unreachable cluster from a ClusterSet is an operational decision with consequences, and it should be made explicitly.

Wrapping up

The Percona Operator for MySQL allows extending Group Replication beyond a single site by managing InnoDB ClusterSet through a custom resource PerconaServerMySQLClusterSet. A primary cluster handles writes, replica clusters stay synchronized, and the operator manages switchovers, failovers, and membership changes declaratively.

For planned maintenance, switchover moves the primary role safely with no data loss. For outages, forced failover promotes a surviving replica, with the expected risk of losing any writes that had not yet replicated. That replication lag is the practical RPO, so it should be monitored and tested as part of the DR plan.

With the Percona Operator for MySQL, disaster recovery becomes repeatable, Kubernetes-native, and easier to operate across regions or clusters.

MySQL Community Server 26.7 Early Access Release

Oracle MySQL Group — Fri, 03 Jul 2026 16:58:15 +0000

MySQL 26.7 is the initial MySQL Innovation release following the MySQL 9.7 LTS release and uses the new yy.mmCalVer versioning model for quarterly Innovation releases. This Early Access release provides a preview of selected functionality planned for the MySQL Community Server package and gives users an opportunity to evaluate upcoming changes before general availability. Download MySQL […]

Still on MySQL 5.7 or 8.0? Those high-severity CVE fixes are covered

MySQL Performance Blog — Thu, 02 Jul 2026 08:01:50 +0000

Upstream MySQL published an out-of-schedule release this week with two high-severity CVE fixes. If you’re running Percona Server for MySQL 5.7 or 8.0 under Extended Lifecycle Support (ELS), the program we previously called Post EOL Support, you don’t have to do anything to qualify for them. We’ve already applied the fixes and re-released the affected ELS builds.

This is the point of ELS. When a major version reaches End of Life (EOL), the community stops shipping patches, but the databases running on it don’t stop mattering. ELS keeps critical bug and security fixes coming for versions that are past their EOL date, so you can stay on 5.7 or 8.0 on your own timeline instead of a deadline someone else set.

What we did

These CVE fixes landed upstream outside the normal cadence. Under ELS, customers are entitled to security fixes for the versions they run, so we pulled the patches into the 5.7 and 8.0 builds and re-released them. ELS customers will get access to the updated builds from the usual private repository in the next couple of weeks.

Why this matters if you’re still on 5.7 or 8.0

Percona Server for MySQL 5.7 reached EOL in October 2023. Percona Server for MySQL 8.0 reached EOL in April 2026. Plenty of production systems are still on both, and not every migration can happen on the upstream’s schedule. Running an unpatched database past EOL is where the real risk sits: no security fixes, no bug fixes, and no support when something breaks at 2:00 a.m.

ELS closes that gap. You keep getting the critical fixes, including out-of-schedule security patches like these, while you plan an upgrade on terms that work for your team.

Where to go from here

If you’re on 5.7 or 8.0 and don’t have ELS in place, now is a good time to look at it. The fixes we just shipped are exactly what the program is for. See the details for your version: Extended Lifecycle Support for MySQL 8.0 or Extended Lifecycle Support for MySQL 5.7. Or reach out via percona.com or the Percona Community Forum to discuss coverage for your environment.

Written by @Dennis Kittrell – Reviewed by @Matthew Boehm & @Varun Nagaraju

The post Still on MySQL 5.7 or 8.0? Those high-severity CVE fixes are covered appeared first on Percona.

MySQL & MySQL HeatWave Report – June 2026

Olivier Dasini — Wed, 01 Jul 2026 13:05:59 +0000

Keeping up with the MySQL ecosystem is becoming increasingly challenging. Every release introduces new features, performance improvements, security enhancements, and cloud capabilities. While the official documentation is comprehensive, it is not always easy to quickly identify what really matters.

To help with that, I've published a new edition of my MySQL & MySQL HeatWave Report, covering the most important announcements around MySQL 9.7 LTS and MySQL HeatWave 9.7.
Slides: https://speakerdeck.com/freshdaz/mysql-and-mysql-heatwave-report-june-2026

The post MySQL & MySQL HeatWave Report – June 2026 first appeared on Data Daz (dasini.net) - Data Systems, AI, and Real-World Insights.

Skipping Percona Server for MySQL 8.4.9 and 9.7.0

MySQL Performance Blog — Mon, 29 Jun 2026 15:19:11 +0000

Update, July 1, 2026: Percona Server for MySQL 8.4.10-10 is now available. It carries the content originally planned for 8.4.9 plus the upstream security fixes. See the 8.4.10-10 release notes. 9.7.1 is still on the way; we’ll link its release notes here when it ships.

Upstream MySQL published an out-of-schedule release this week with two high-severity CVE fixes. We’ve pulled those fixes into our next builds and are skipping the two versions we had already queued: Percona Server for MySQL 8.4.9 and 9.7.0.

These fixes arrived through Oracle’s new monthly Critical Security Patch Updates (CSPUs), which Oracle announced begin May 28, 2026. CSPUs ship targeted high-severity fixes between Oracle’s quarterly Critical Patch Updates. For MySQL, these updates are issued as needed rather than on a fixed monthly schedule, so out-of-schedule security fixes like these may become more common.

We’ve handled a skip like this before. When MySQL Community Server 8.4.2 followed 8.4.1 by only a few weeks, we skipped 8.4.1 and shipped its contents in 8.4.2-2. This is the same approach.

What’s happening

The code for 8.4.9 and 9.7.0 was already ready for packaging when the CVE fixes landed. Rather than ship those builds and follow immediately with a security patch, we applied the fixes, re-tested, and re-tagged. Percona Server for MySQL 8.4.10 and 9.7.1 will carry everything 8.4.9 and 9.7.0 would have contained, plus the upstream high-severity CVE fixes.

These fixes come from Oracle’s June 2026 Critical Security Patch Update; the specific CVE identifiers will be listed in the 8.4.10 and 9.7.1 release notes. No action is required on your part. The fixes reach you in 8.4.10 and 9.7.1, expected within days. If your security policy requires faster remediation, contact Percona Support to discuss interim options.

8.4.9 and 9.7.0 will not appear in the package repositories. A normal upgrade moves you straight to 8.4.10 or 9.7.1, which carry the skipped versions’ content.

Who this affects

If you were waiting specifically for 8.4.9 or 9.7.0, those versions won’t be published. Point your upgrade at the next releases instead, which include the same content and the CVE fixes. The delay is a few days, not weeks. If you weren’t tracking a specific version number, nothing changes for you.

What to do

Nothing urgent. Upgrade to the next Percona Server for MySQL releases as you normally would once they’re published. We’ll announce them through release notes and the Percona Blog. For questions about timing or the security content, reach out to Percona Support or post in the Percona Community Forum.

What to expect going forward

Oracle’s monthly CSPUs mean out-of-schedule fixes will happen more often. Our approach stays consistent: we evaluate every upstream release, and when high-severity fixes land between our scheduled releases, we fold them into the next release rather than shipping a separate build for each one. Your LTS support commitments don’t change. We’re watching how often Oracle uses the monthly cadence and will adjust release planning if the volume warrants it.

The post Skipping Percona Server for MySQL 8.4.9 and 9.7.0 appeared first on Percona.

Continuing the Conversation: MySQL Community Engagement Across JAPAC

Oracle MySQL Group — Mon, 29 Jun 2026 02:57:35 +0000

One of the key themes of the MySQL Community over the past year has been increasing transparency, participation, and collaboration. Through Public Discussions, Design Proposals, the MySQL Developer Guide, GitHub collaboration, and the MySQL Contributor Summit, we have been working to create more opportunities for the community to engage with the future direction of MySQL. […]

Planet MySQL

A first look at MySQL 26.7 Early Access

MySQL 9.7 Community Edition: Smarter Join Planning with the Hypergraph Optimizer

OCI Cache and MySQL HeatWave: Better Together for High-Performance Applications

MySQL on OKE: Database Operations as Kubernetes State

From Tokyo to Seoul to Taipei: MySQL Community Conversations Across JAPAC

Optimizing Replication Lag for Large Transactions and DDL in MySQL

How Binlog Realtime Replication Works

Implementing Realtime Replication

Overall BRR Architecture

Primary Side

Replica Side

The gtid_executed Snapshot

Realtime Replication of Large Transactions

Creating and Updating a Brr_trx

Transmitting Binlog Events

Committing the Transaction

Rolling Back the Transaction

Realtime Application of DDL

Creating a Brr_trx

Two gtid_executed Snapshots

Shipping Binlog Events Twice

Conclusion

MySQL Major Version Upgrade Checklist – how to

Missed the May 2026 MySQL Contributor Summit? Watch Every Session On Demand

Binlog Transmission Optimization for Large MySQL Transactions

Root Cause

An RPO = 0 Design Based on Semi-Synchronous Replication

Realtime Transmission of Large Transactions

Relay Log Cache

Handling Failures

Results

Conclusion

Inside MySQL 9.7 LTS Features

Flow-control monitoring in Group Replication

Multi-threaded applier extended statistics

Automatic eviction & rejoin

Up-to-date aware Primary election

MySQL JSON duality views

Hypergraph Optimizer

Higher version source allowed

Summary

Commit Optimization for Large MySQL Transactions

Root Cause

How Serious the Problem Is

Optimizing How Large Transactions Write the Binlog

The Approach

The #binlog_cache_files Directory

Reserving Head Space

The Rename Process

Results

Conclusion

MyDumper Locking Mechanisms Revisited: Introducing SAFE_NO_LOCK

What is --sync-thread-lock-mode?

Understanding SAFE_NO_LOCK

Choosing the Right Mode

AUTO (The Default)

FTWRL (Flush Tables With Read Lock)

GTID

SAFE_NO_LOCK

NO_LOCK

LOCK_ALL

Conclusion

Dynamic Data Masking (DDM) with MySQL Enterprise Edition 9.7: Reduce your sensitive data exposure.

InnoDB Flushing is simple – explained

Running DuckDB as a MySQL 9.7 storage engine

Why we made it

What it actually is

How it hooks into MySQL

Getting started

Does it actually go fast?

SF10, around 60 million lineitem rows

SF100, around 600 million lineitem rows

Try it, then tell us

MySQL 9.x: Moving Away From SHA1 and MD5

Cross-site Disaster Recovery with Percona Operator for MySQL

Understanding InnoDB ClusterSet

Setting up ClusterSet

Creating a primary cluster

Creating the replica cluster

What is `--sync-thread-lock-mode`?