ovais.tariq - tech and leadership

Monitoring ProxySQL using Datadog

Ovais Tariq — Thu, 19 Jan 2017 15:34:20 GMT

ProxySQL is a high performance proxy for MySQL and its forks. One of the key features is its ability to handle hundreds of thousands of connections with very low overhead. Datadog is a monitoring service for cloud-scale applications, bringing together data from servers, databases, tools, and services to present a unified view of an entire stack.

Datadog does not yet provide an integration for ProxySQL. So I decided to write an integration by forking the Datadog agent. Read my detailed blog post on TwinDB Blog to learn how to use the ProxySQL-Datadog integration.

Slides of my talk on Monitoring MySQL at Scale

Ovais Tariq — Wed, 08 Jun 2016 07:00:00 GMT

The slides of my talk on best practices to monitor large scale MySQL deployments, are now available for download. This slide was presented during Percona Live 2016.

Monitoring MySQL at scale from Ovais Tariq

Extend MySQL Master HA (MHA) capabilities with MHA Helper

Ovais Tariq — Mon, 01 Feb 2016 13:58:10 GMT

I have used many tools starting with MMM to be able to manage MySQL replication clusters. Some of the tools need more tools and complex HA solutions such as Pacemaker and Corosync, or Zookeeper. While other tools do not do the failover well which leaves the slaves in an inconsistent state, MMM would be an example.
And I must say that of all the tools I love MySQL Master HA (MHA) the most. MHA is a great tool to manage MySQL replication clusters for the purpose of HA. The most important thing about MHA is that it tries to take all the necessary steps to do a MySQL master failover in a way that provides as much data consistency as possible. The slave promotion also tends to be very quick, on average I have seen it take 10 to 15 seconds. It is also very easy to deploy unlike some of the other complex HA solutions.

I would highly recommend reading about the architecture of MHA on its wiki: https://code.google.com/p/mysql-master-ha/wiki/Architecture

Why MHA Helper?

MHA does one job and it does it well. It handles slave promotion in the best possible way. However, slave promotion in case of a master failure or in case of planned maintenance is only a step in the process. Typically a lot more is involved after a slave promotion has happened. For example, the application needs to be notified that the master has changed and that it needs to write to a new master. Then there may be some other operational tasks that need to be performed, such as notifying the monitoring service that the master has changed, or notifying the configuration management service so that it writes the correct configuration for a master. There can be a host of other operations that may need to be performed, I have just given a few examples that are most likely to be needed in majority of the cases.
That’s where MHA Helper comes in. MHA Helper provides a pluggable interface that extends MHA such that additional tasks may be performed in case of a MySQL failover.

Virtual IP management using MHA Helper

Currently MHA Helper provides virtual IP management for MySQL replication clusters. Virtual IP based HA solution is the most common implementation I have seen across a large number of MySQL users. Below is a simple illustration that shows a Virtual IP being used by the app to connect to MySQL.

Now when a failover happens the Virtual IP gets moved to the promoted master and the apps disconnect and reconnect using the same Virtual IP.

Now where does MHA Helper fit in here? MHA Helper acts as an external plugin that MHA invokes during various stages of the failover and MHA Helper then takes care of the necessary pre and post failover steps depending on the type of the failover.

MySQL read_only flag handling. This also includes supporting the super_read_only flag that is available in Percona Server and other variants of MySQL
Handling of Virtual IP failover
Handling MySQL connections termination to make failover fast

Pluggable and Extensible Architecture of MHA Helper

MHA Helper has been designed to be pluggable and extensible. It is configured through ini-style configuration files, with one file per MySQL replication cluster, meaning that different replication clusters can have totally different configuration setup without effecting each other. A complete list of configuration options with examples can be seen on the Github page https://github.com/ovaistariq/mha-helper#configuration

MHA Helper currently only has the Virtual IP based failover implementation, however it supports adding different implementations, for example a different implementation for MySQL instances running in AWS, or other types of clouds. Because of the pluggable architecture of MHA Helper, it is easy to add more integrations to it. For example, currently Datadog and Chef integrations are in the works. The idea with Datadog integration is to send an event to Datadog whenever failover is performed as well as set appropriate tags on the new master to keep the metrics collection up to date.

If you have any more integration ideas, I would highly suggest that you file a bug https://github.com/ovaistariq/mha-helper/issues

Installation

The installation of MHA Helper is extremely easy. All you need to do is install the yum repository

curl -s https://packagecloud.io/install/repositories/twindb/main/script.rpm.sh | sudo bash

And install the packages

yum install mha4mysql-manager mha4mysql-node python-mha_helper

And that’s it!

For detailed instructions on configuration and installation visit the repository page on Github https://github.com/ovaistariq/mha-helper And don’t forget to file bugs or feature requests.

Chef multipath cookbook version 0.0.9 now available

Ovais Tariq — Mon, 01 Jun 2015 16:17:57 GMT

I have released version 0.0.9 of Chef multipath cookbook. The cookbook now supports Pure Storage SAN among a bunch of other improvements.

Below is the list of changes and improvements in version 0.0.9 of Chef multipath cookbook:

Added support for Pure Storage SAN
Added test-kitchen tests
Reimplemented the chefspec tests
Fixed the issues reported by foodcritic

The cookbook is now tested to work with Chef 12.

To configure multipath for Pure Storage LUNs, all you need to do is set the following attribute:
node["multipath"]["storage_type"] = "purestorage"

Feel free to contribute in the form of pull requests and bug reports. The repository is available at https://github.com/ovaistariq/cookbook-multipath

Beware of MySQL BLOB Corruption in Older Versions

Ovais Tariq — Wed, 24 Dec 2014 12:06:41 GMT

Does your dataset consist of InnoDB tables with large BLOB data such that the data is stored in external BLOB pages? Was the dataset created in MySQL version 5.1 and below and without using the InnoDB plugin, or with InnoDB plugin but with MySQL version earlier than 5.1.55? If the answer to both the questions are “YES” then it could very well be that you have a hidden corruption lying around in your dataset. The only way you would be able to find out about the corruption is when you have a crash with InnoDB assertion messages similar to the following:
InnoDB: Serious error! InnoDB is trying to free page 4 InnoDB: though it is already marked as free in the tablespace! InnoDB: The tablespace free space info is corrupt.

In this post I will summarize what the bug is and how it corrupts the dataset. If you want more details of why and how the corruption manifests itself then you can additionally read the following bug reports:
http://bugs.mysql.com/bug.php?id=55543
http://bugs.mysql.com/bug.php?id=55284
http://bugs.mysql.com/bug.php?id=55981

MySQL BLOB Corruption

The bug involves records that contain BLOBs that are stored off page on external pages. When are BLOBs stored off-page depends on the size of the blob and the row format and sometimes also depends on the size of the rows. Since we are mainly talking about older MySQL versions, so we would only be dealing with COMPACT and REDUNDANT row formats. Basically, InnoDB tries to store the entire BLOB on the same InnoDB page, but if the row size is large such that at least two rows cannot be stored on the page, then InnoDB will store the first 768 bytes from the BLOB on the page where the row is stored and the rest of the BLOB data is stored on external page. The page that contains the row then contains the pointer to the external page so that when InnoDB is reading the row it can lookup the BLOB by following the pointer.

Now coming back to the bug. The bug is caused by certain parts of the InnoDB code that cause the external page holding the BLOB data to get orphaned.
One way the page gets orphaned is when the PRIMARY key column is being updated, but the update is rolled back for some reason. Now as you know that updating the PRIMARY key column would require a delete of the old row and a creation of a new row with the new data. It would also require moving the pointer data pointing to the external BLOB page from the old row to the new row. This moving of the association between the row and the external page from old row to new row was not being done in a consistent and transaction safe way which would cause the association to get lost in case of a transaction rollback.

What was really happening can be seen as a sequence of the below events:

a. A transaction modifies the PRIMARY key column
b. Modification of PRIMARY key column causes the creation of a new row with the new data and old row being delete-marked
c. The association between the old row and the external BLOB page is changed, such that the external BLOB page is now associated with the new row.
d. The transaction is rolled back. The rollback undoes the changes, but does not change the association of the external BLOB page back to the old row.

Hence, a transaction rollback would cause the BLOB page to get orphaned and freed. Once the page is freed it can be reused by InnoDB for storing BLOB data for other rows. The important thing to note here is that the original row that we tried to modify (but rolled back the changes) still contains a pointer to the external BLOB page. The thing that is lost is the association from InnoDB’s perspective and that is why it considers the page free to be reused.

Now, this issue does not immediately cause any crashes. So you would never really know. A crash would only happen when the row above that we tried to modify (but rolled back the change) is deleted. Marko summarizes when this bug would cause a crash as follows:
“Furthermore, if your tablespace has been created and modified with an old (buggy) version of InnoDB, a BLOB page could be freed and reused for something else. InnoDB would read and deliver the overwritten contents of the BLOB page to queries. The problem would not be detected until you actually delete the row and both the original BLOB page and the reused BLOB page have been freed.”

Is there any other way to test for this kind of corruption? No, there is no way possible. Neither “CHECK TABLE” nor innochecksum validate pointers to BLOB pages in any way.
Below is an excerpt from the MySQL manual:
“CHECK TABLE surveys the index page structure, then surveys each key entry. It does not validate the key pointer to a clustered record or follow the path for BLOB pointers.”

Mitigation

So what is the safest path then? The safest thing to do is to rebuild the dataset by means of dump and reload. We mostly have slaves and secondary masters created using some form of filesystem copy, either a direct copy or such as using XtraBackup. So if a crash does happen due to this bug, since it would most likely happen after a DELETE gets executed, the crash would replicate to the slaves as well, which if created using some form of filesystem copy would crash as well. A crash of the replication hierarchy would be a disaster.

Another thing to be careful about is when running tools that would DELETE rows containing large BLOB values. For example, if you run pt-table-sync on a table with hidden corruption and pt-table-sync has to sync records then there is a chance that could cause a crash. This is because by default pt-table-sync executes REPLACE statements which are internally mapped to DELETE+UPDATE.

Conclusion

So to summarize, if your dataset was created such that it meets either of the following conditions:
a. Dataset created under MySQL version <= 5.1 without using innodb plugin, or b. dataset created under mysql version < 5.1.55 and plugin then it is a very good idea to rebuild the dump reload. bugs were only fixed in above for plugin. p>

Nasty MySQL Replication Bugs that Affect Upgrade to 5.6

Ovais Tariq — Tue, 25 Nov 2014 13:52:51 GMT

There were two nasty MySQL replication bugs in two different 5.6 releases that would make it difficult to upgrade slaves to MySQL 5.6 while still connected to MySQL 5.5 master.

The first of those bugs is MySQL bug 72610 which affects 5.6.19. Essentially this bug is triggered when the table structure on the slave is different from the table structure on the master which leads to unnecessarily large amount of RAM usage while replicating events that affect that table. The amount of RAM used would generally be more noticeable when the replicated transaction consists of thousands of RBR events.

The most common way this affects how we upgrade a replication hierarchy, is when we have the master running MySQL 5.5 and the slave running MySQL 5.6 and we have transactions involving DATETIME column(s). Tables with DATETIME columns will have different underlying structure when created on MySQL 5.5 versus when created on MySQL 5.6. Ideally you would avoid creating a new table with temporal columns while you still have master and slave on different MySQL major versions. Ideally you would also want to avoid running statements that would ALTER the structure of the table with temporal columns, such as running a NOOP ALTER TABLE or adding a new column to a table with temporal column(s). As any such operation will end up creating a new table with a new underlying structure.

Coming back to the bug itself, this bug was fixed in MySQL 5.6.21 which fixed excessive memory usage issue. However, the fix introduced a crashing bug details of which are reported in a Percona Server bug 1380010. This isn’t specifically a Percona Server bug, but rather a MySQL server bug. The crashing bug again affects MySQL upgrades badly. The bug, again, comes into play when you have master and slave with table(s) that have different underlying structure. Which is, again, something that you will likely see when you are upgrading a replication hierarchy, where for sometime you will have MySQL 5.5 master replicating to MySQL 5.6 slaves.

The good news is that Percona Server 5.6.21-70.1 released on 24 November, 2014 fixes the crashing bug. So its safe again to upgrade to Percona Server 5.6.21-70.1 from an older MySQL or Percona Server version. I would love to see this fix in the upstream MySQL server as well.

Speedup Test Kitchen Vagrant Infrastructure Code Testing

Ovais Tariq — Sat, 15 Nov 2014 01:18:46 GMT

Test Kitchen together with Vagrant is a wonderful way to test out your infrastructure deployment and orchestration code. It makes test-driven development especially easy by allowing you to test locally using virtual machines. However, those of you who do a lot of testing with Test Kitchen and Vagrant would know that waiting for tests to complete can be painfully long. This is especially true when with every test run the same packages have to be downloaded over and over again. Or when Vagrant decides to update the Virtual Box Guest Addition plugin every time you run your test suites.

Christine Draper has an excellent post which allows one to remove much of the waiting and make test runs instant. These basically involve installing two Vagrant plugins vagrant-cachier and vagrant-omnibus and disabling automatic Virtual Box Guest Addition plugin updates.

I would not detail why and how here, as details are already present in Christine’s post which I have referenced. I would rather provide you with a Vagrantfile template that can be used with Test Kitchen so that we can make it use the two plugins and disable Virtual Box Guest Addition plugin updates.

Below is the content which you will need to write to a file named Vagrantfile.erb

Vagrant.configure("2") do |c|
  c.vm.box = "<%= config[:box] %>"
  c.vm.box_url = "<%= config[:box_url] %>"

  if Vagrant.has_plugin?("vagrant-cachier")
    c.cache.auto_detect = true
    c.cache.scope = :box
  end

  if Vagrant.has_plugin?("vagrant-omnibus")
    c.omnibus.cache_packages = true
    c.omnibus.chef_version = "11.16.4"
  end

  c.vbguest.auto_update = false

<% if config[:vm_hostname] %>
  c.vm.hostname = "<%= config[:vm_hostname] %>"
<% end %>
<% if config[:guest] %>
  c.vm.guest = <%= config[:guest] %>
<% end %>
<% if config[:username] %>
  c.ssh.username = "<%= config[:username] %>"
<% end %>
<% if config[:ssh_key] %>
  c.ssh.private_key_path = "<%= config[:ssh_key] %>"
<% end %>

<% array(config[:network]).each do |opts| %>
  c.vm.network(:<%= opts[0] %>, <%= opts[1..-1].join(", ") %>)
<% end %>

  c.vm.synced_folder ".", "/vagrant", disabled: true
<% config[:synced_folders].each do |source, destination, options| %>
  c.vm.synced_folder "<%= source %>", "<%= destination %>", <%= options %>
<% end %>

  c.vm.provider :<%= config[:provider] %> do |p|
<% config[:customize].each do |key, value| %>
  <% case config[:provider] when "virtualbox" %>
    p.customize ["modifyvm", :id, "--<%= key %>", "<%= value %>"]
  <% when "rackspace", "softlayer" %>
    p.<%= key %> = "<%= value%>"
  <% when ^vmware_ %>
    <% if key="=" :memory %>
      <% unless config[:customize].include?(:memsize) %>
    p.vmx["memsize"] = "<%= value %>"
      <% end %>
    <% else %>
    p.vmx["<%= key %>"] = "<%= value %>"
    <% end %>
  <% end %>
<% end %>
  end

end

The file Vagrantfile.erb can be in the same directory as the .kitchen.yml file and can be referenced in .kitchen.yml in the “platforms” section as follows:

platforms:
  - name: centos-6.4
    driver:
      vagrantfile_erb: Vagrantfile.erb

That’s about it. Next time you test your infrastructure code, don’t forget to use the Vagrantfile.erb template. You will be pleased to see how quickly the tests finish.

Percona XtraDB Cluster - A Drop-in-place Clustering Solution for MySQL

Ovais Tariq — Thu, 31 Oct 2013 14:06:28 GMT

Emphasis on clustering solutions comes up quite a lot when talking to customers about High Availability. The reason is because clustering is supposed to provide an easier solution for maintaining high availability and so that you do not have to rely on other tools and techniques outside of the database server.

I thought it would be good to share the gist of many of my discussions around clustering, in the form of a blog post.

People usually tend to compare MySQL NDB Cluster and Percona XtraDB Cluster but both of them really are very different solutions.

For one NDB Cluster would mean a complete rethink of how data is accessed by the application. You also get to have to deal with a storage engine that works and behaves differently from InnoDB storage engine. The key point with NDB Cluster is data partitioning between different nodes. Not all applications are built with partitioning in mind specifically. And such you would have to adapt the application to make sure that it is aware of that and can distribute the workload effectively. This may mean partitioning in such a way that a single application request does not imply having to go to each partition to fetch data. I have typically seen a lot of effort put into applications when moving from traditional InnoDB based solution to NDB based solution. Moving to NDB Cluster is a major major change and as such implies that you would probably need to rewrite parts of the application to make it work effectively.

To summarize: MySQL NDB Cluster is not a drop-in-place clustering solution.

However, Percona XtraDB Cluster, which is Galera-based clustering solution, is a drop-in-place clustering solution and it truly is. You might have to do minute changes in the application, but you can even live with not having to do any. If you are already using InnoDB storage engine, then you do not need to make any changes to the application, except may be for the fact that deadlocks happen a bit differently. Percona XtraDB Cluster provides a clustering layer over traditional MySQL server with InnoDB storage engine, such that everything is exactly the same except how replication is performed. XtraDB Cluster is a synchronous replication cluster, meaning that every transaction COMMIT implies that the transaction is replicated to every node in the cluster where a process known as certification is performed to validate the transaction that is committed and perform conflict resolution. And you get other benefits with it:

High Availability: You can read/write to any node, when a node goes down, you can start reading/writing from a different node
Data Consistency: You do not have to worry about data consistency in the same way as you do with regular MySQL master-slave pair
Parallel Replication: You get true parallel replication without the restrictions that are present in MySQL 5.6
Partitioning Protection: Partitioning protection is built inside this clustering solution

What I really like about this solution, other then the fact that this is a true drop-in-place solution, is that you do not have to rely on other pieces of software to get high availability or partitioning protection. These things are already taken care of and as such this greatly reduces the complexity of this clustering solution. Otherwise with a typical MySQL master-slave setup you have to rely on other solutions such as Pacemaker to get high availability and other related things. The other difference of course when compared to MySQL NDB Cluster is that data is not partitioned in any way, all the data is present on every node in a XtraDB Cluster.

To summarize: Percona XtraDB Cluster is a drop-in-place clustering solution.

I would suggest that you read more about Percona XtraDB Cluster here and give it a try when the next time you want to move to a clustering solution.

InnoDB scalability issues due to tables without primary keys

Ovais Tariq — Fri, 18 Oct 2013 23:50:20 GMT

Each day there is probably work done to improve performance of the InnoDB storage engine and remove bottlenecks and scalability issues. Hence there was another one I wanted to highlight: Scalability issues due to tables without primary keys. This scalability issue is caused by the usage of tables without primary keys. This issue typically shows itself as contention on the InnoDB dict_sys mutex. Now the dict_sys mutex controls access to the data dictionary. This mutex is used at various important places throughout the InnoDB code and as such any contention on the dict_sys mutex is going to have a InnoDB system-wide negative affect. You can read the rest of the post here.

Implications of Metadata Locking Changes in MySQL 5.5

Ovais Tariq — Sat, 09 Feb 2013 22:52:48 GMT

While most of the talk recently has mostly been around the new changes in MySQL 5.6 (and that is understandable), I have had lately some very interesting cases to deal with, with respect to the Metadata Locking related changes that were introduced in MySQL 5.5.3. It appears that the implications of Metadata Locking have not been covered well, and since there are still a large number of MySQL 5.0 and 5.1 installations that would upgrade or are in the process of upgrading to MySQL 5.5, I thought it necessary to discuss what these implications exactly are. You can read the rest of the post here.

On SSDs - Lifespans, Health Measurement and RAID

Ovais Tariq — Thu, 11 Oct 2012 17:54:57 GMT

Solid State Drive (SSD) have made it big and have made their way not only in desktop computing but also in mission-critical servers. SSDs have proved to be a break-through in IO performance and leave HDD far far behind in terms of Random IO performance. Random IO is what most of the database administrators would be concerned about as that is 90% of the IO pattern visible on database servers like MySQL. I have found Intel 520-series and Intel 910-series to be quite popular and they do give very good numbers in terms of Random IOPS. However, its not just performance that you should be concerned about, failure predictions and health gauges are also very important, as loss of data is a big NO-NO. There is a great deal of misconception about the endurance level of SSD, as its mostly compared to rotating disks even when measuring endurance levels, however, there is a big difference in how both SSD and HDD work, and that has a direct impact on the endurance level of SSD.

I will mostly be taling about MLC SSD, now let’s start off with a SSD primer.

SSD Primer

The smallest unit of SSD storage that can be read or written to is a page which is typically 4KB or 8KB in size. These pages are typically organized into blocks which are between 256KB or 1MB in size. SSDs have no mechanical parts and no heads or anything and their is no seeks needed as in conventional rotating disks. Reads involve reading pages from the SSD, however its the writes that are more tricky. Once you write to a page on SSD, you cannot simply overwrite (if you want to write new data) it in the same way you do with a HDD. Instead, you must erase the contents and then write again. However, a SSD can only do erasures at the block level and not the page level. What this means is that the SSD must relocate any valid data in the block to be erased, before the block can be erased and have new data written to it. To summarize, writes mean erase+write. Nowadays, SSD controllers are intelligent and do erasures in the background, so that the latency of the write operation is not affected. These background erasures are typically done within a process known garbage collection. You can imagine if these erasures were not done in the background, then writes would be too slow.

Of course every SSD has a lifespan after which it can be seen as unusable, let’s see what factors matter here.

SSD Lifespans

The lifespan of blocks that make up a SSD is really the number of times erasures and writes can be performed on those blocks. The lifespan is measure in terms of erase/write cycles. Typically enterprise grade MLC SSDs have a lifespan of about 30000 erase/write cycles, while consumer grade MLC SSD have a life span of 5000 to 10000 erase/write cycles. This fact makes it clear that the lifespan of a SSD depends on how much time it is written to. If you have a write-intensive workload then you should expect the SSD to fail much more quickly, in comparison to a read-heavy workload. This is by design.
To offset this behaviour of writes reducing the life of a SSD, engineers use two techniques, wear-levelling and over-provisioning. Wear-levelling works by making sure that all the blocks in a SSD are erased and written to in a evenly distributed fashion, this makes sure that some blocks do not die quickly then other blocks. Over-provisioning SSD capacity is one another technique that increases SSD endurance. This is accomplished by having a large population of blocks to distribute erases and writes over time (bigger capacity SSD), and by providing a large spare area. Many SSD models over provision the space, for example a 80GB SSD could have 10GB of over-provisioned space, so that while it is actually 90GB in size it is reported as a 80GB SSD. While this over-provisioning is done by the SSD manufacturers, this can also be done by not utilising the entire SSD, for example partitioning the SSD in such a way that you only partition about 75% to 80% of the SSD and leave the rest as RAW space that is not visible to the OS/filesystem. So while over-provisioning takes away some part of the disk capacity, it gives back in terms of increased endurance and performance.

Now comes the important part of the post that I would like to discuss.

Health Measurement and failure predictability

As you may have noticed after reading the above part of this post, its all the more important to be able to predict when a SSD would fail and to be able to see health related information about the SSD. Yet I haven’t found much written about how to gauge the health of a SSD. RAID controllers employed with SSD tend to be very limited in terms of the amount of information that they provide about an SSD that could allow predicting when a SSD could fail. However, most of the SSD provide a lot of information via S.M.A.R.T. and this can be leveraged to good affect.
Let’s consider the example of Intel SSD, these SSD have to S.M.A.R.T. attributes that can be leveraged to predict when the SSD would fail. These attributes are:

Available_Reservd_Space: This attribute reports the number of reserve blocks remaining. The value of the attribute starts at 100, which means that the reserved space is 100 percent available. The threshold value for this attribute is 10 which means 10 percent availability, which indicates that the drive is close to its end of life.
Media_Wearout_Indicator: This attribute reports the number of erase/write cycles the NAND media has performed. The value of the attribute decreases from 100 to 1, as the average erase cycle count increases from 0 to the maximum rated cycles. Once the value of this attribute reaches 1, the number will not decrease, although it is likely that significant additional wear can be put on the device. A value of 1 should be thought of as the threshold value for this attribute.

Using the smartctl tool (part of the smartmontools package) we can very easily read the values of these attributes and then use it to predict failures. For example for SATA SSD drives attached to a LSI Megaraid controller, we could very easily read the values of those attributes using the following bash snippet:

Available_Reservd_Space_current=$(smartctl -d sat+megaraid,${device_id} -a /dev/sda | grep "Available_Reservd_Space" | awk '{print $4}')
Media_Wearout_Indicator_current=$(smartctl -d sat+megaraid,${device_id} -a /dev/sda | grep "Media_Wearout_Indicator" | awk '{print $4}')

Then the above information can be used in different fashions, we could raise an alert if its nearing the threshold value, or measure how quickly the values decrease and then use the rate of decrease to estimate when the drive could fail.

SSDs and RAID levels

RAID have been typically with HDD used for data protection via redundancy and for increased performance, and they have found their use with SSD as well. Its common to see RAID level 5 or 6 being used with SSD on mixed read/write workloads, because the write penalty visible by using these level with rotating disks, is not of that extent when talking about SSD because there is no disk seek involved, so the read-modify-write cycle typically involved with parity based RAID levels does not cause a lot of performance hit. On the other hand striping and mirroring does improve the read performance of the SSD a lot and redundant arrays using SSD deliver far better performance as compared to HDD arrays.
But what about data protection? Do the parity-based RAID levels and mirroring provide the same level of data protection for SSDs as they are thought of? I am skeptical about that, because as I have mentioned above the endurance of a SSD depends a lot on how much it has been written to. In parity-based RAID configurations, a lot of extra writes are generated because of parity changes and they of course decrease the lifespan of the SSD, similarly in the case of mirroring, I am not sure it can provide any benefit in case of wearing out of SSD, if both the SSD in the mirror configuration have the same age, why? Because in mirroring both the SSDs in the array would be receiving the same amount of writes and hence the lifespan would decrease at the same amount of time.
I would think that there is some drastic changes that are needed to the thought process when thinking of data protection and RAID levels, because for me parity-based configuration or mirroring configuration are not going to provide any extra data protection in cases where the SSD used are of similar ages. It might actually be a good idea to periodically replace drives with younger ones so as to make sure that all the drives do not age together.

I would like to know what my readers think!

Join Optimizations in MySQL 5.6 and MariaDB 5.5

Ovais Tariq — Thu, 31 May 2012 17:24:42 GMT

This is the third blog post in the series of blog posts leading up to the talk comparing the optimizer enhancements in MySQL 5.6 and MariaDB 5.5. This blog post is targeted at the join related optimizations introduced in the optimizer. These optimizations are available in both MySQL 5.6 and MariaDB 5.5, and MariaDB 5.5 has introduced some additional optimizations which we will also look at, in this post. You can read the rest of the post here.

Multi Range Read (MRR) in MySQL 5.6 and MariaDB 5.5

Ovais Tariq — Wed, 21 Mar 2012 20:30:45 GMT

I have written a second blog post in the series of blog posts leading up to the talk comparing the optimizer enhancements in MySQL 5.6 and MariaDB 5.5. This blog post is aimed at the optimizer enhancement Multi Range Read (MRR). Its available in both MySQL 5.6 and MariaDB 5.5

You can read the entire blog post here.

Index Condition Pushdown in MySQL 5.6 and MariaDB 5.5 and its performance impact

Ovais Tariq — Tue, 13 Mar 2012 13:37:59 GMT

I have been working with Peter in preparation for the talk comparing the optimizer enhancements in MySQL 5.6 and MariaDB 5.5. We are taking a look at and benchmarking optimizer enhancements one by one. So in the same way this blog post is aimed at a new optimizer enhancement Index Condition Pushdown (ICP). Its available in both MySQL 5.6 and MariaDB 5.5 You can read more about this at MySQL Performance Blog here.

Profiling your slow queries using pt-query-digest and some love from Percona Server

Ovais Tariq — Wed, 28 Dec 2011 17:23:13 GMT

Overview

Profiling, analyzing and then fixing slow queries is likely the most oft-repeated part of a job of a DBA. And there are not too many tools out there that can make your life easy by providing analysis of queries with such data points that allow you to attack the right queries in the right way. One such tool that I have always found myself using is pt-query-digest(formerly known as mk-query-digest).

Now let us go through using this very nice tool.

Before We Start!

But before we start, make sure you have enabled slow query logging and set a low enough long_query_time. The correct value of long_query_time depends on your application requirements, a long_query_time of 1 second or 2 seconds might be sufficient for most of the users. Its also typical to see in many cases that you set the long_query_time to 0 for a small period of time to log all the queries.

Note that logging all queries in this fashion as opposed to the general query log, enables us to have the statistics available after the query is actually executed, while no such statistics are available for queries that are logged using the general query log.

And there might be other cases when you want to log queries taking less than 1 second of time(micro-seconds), for that you can specify a value in fractions, for example, you can specify long_query_time=0.5 to log queries taking greater than half-a-second.

Note that logging queries taking fraction of a second is not possible for versions of MySQL < 5.1, unless you use the microslow patch developed by Percona. You can follow the guide here if you are still running MySQL < 5.1 and would like to install this patch. You should also note that for versions of MySQL < 5.1 setting long_query_time=0 would actually disable the slow query logging.

Installing pt-query-digest tool (as well as other tools from Percona Toolkit) is very easy, and is explained here at this link.

Using pt-query-digest

Using pt-query-digest is pretty straight forward:

pt-query-digest /path/to/slow-query.log

Note that executing pt-query-digest can be pretty CPU and memory consuming, so ideally you should download the “slow query log” to another machine and run it there.

Analyzing pt-query-digest Output

Now let’s see what output it returns. The first part of the output is an overall summary:

# 250ms user time, 20ms system time, 17.38M rss, 53.62M vsz
# Current date: Wed Dec 28 08:16:13 2011
# Hostname: somehost.net
# Files: ./slow-query.log
# Overall: 296 total, 12 unique, 0.00 QPS, 0.00x concurrency _____________
# Time range: 2011-11-26 17:44:58 to 2011-12-27 13:01:44
# Attribute          total     min     max     avg     95%  stddev  median
# ============     ======= ======= ======= ======= ======= ======= =======
# Exec time           736s      1s     23s      2s      6s      2s      1s
# Lock time            5ms       0   290us    17us   103us    43us       0
# Rows sent          8.31M       0   8.31M  28.75k  202.40 477.30k       0
# Rows examine      53.69k       0   3.51k  185.75  964.41  497.06       0
# Rows affecte          38       0       1    0.13    0.99    0.33       0
# Rows read          8.31M       0   8.31M  28.75k  202.40 477.30k       0
# Bytes sent       294.94M       0 275.07M 1020.33k  79.83k  15.63M  79.83k
# Tmp tables            16       0       4    0.05       0    0.45       0
# Tmp disk tbl           0       0       0       0       0       0       0
# Tmp tbl size       1.94M       0 496.08k   6.70k       0  56.06k       0
# Query size        36.89k      44     886  127.63  420.77  144.69   69.19

It tells you that there are a total of 296 slow queries which are actually invocations of 12 different queries. Following that there are summaries of various other data points such as the total query execution time and the average query execution time. One thing I suggest here is that, you should probably give more importance to the times/values reported in the 95% (95th percentile) column as that gives us more accurate understanding. Now, for example, from this summary you can easily see whether you need to bump up the tmp_table_size variable if you see a big “Tmp disk tbl” number, you can easily adjust the value of tmp_table_size by taking a look at the 95% column for the row dealing with “Tmp tbl size”. Pretty nifty!

Let’s analyze next part of the output produced by pt-query-digest.

# Profile
# Rank Query ID           Response time  Calls R/Call  Apdx V/M   Item
# ==== ================== ============== ===== ======= ==== ===== ========
#    1 0x92F3B1B361FB0E5B 644.9895 87.6%   244  2.6434 0.44  1.26 SELECT wp_options
#    2 0x555191621979A464  33.6349  4.6%    30  1.1212 0.65  0.03 REPLACE SELECT test.checksum test.sbtest_myisam
#    3 0x8354260420CBD34B  22.6124  3.1%     1 22.6124 0.00  0.00 SELECT customer address category

The above part of the output ranks the queries and shows the top slowest queries. As we can see here the slowest one is the SELECT wp_options, this is basically a unique way of identifying the query and simply implies that this is a SELECT query executed against the wp_options table.

Now let’s take a look at the most important part of the output:

# Query 1: 0.00 QPS, 0.00x concurrency, ID 0x92F3B1B361FB0E5B at byte 119442
# This item is included in the report because it matches --limit.
# Scores: Apdex = 0.44 [1.0], V/M = 1.26
# Query_time sparkline: |      ^ |
# Time range: 2011-12-08 17:48:20 to 2011-12-27 13:01:44
# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count         82     244
# Exec time     87    645s      1s     10s      3s      6s      2s      1s
# Lock time     37     2ms       0   201us     8us    60us    30us       0
# Rows sent      0   3.40k       0     211   14.28  202.40   51.53       0
# Rows examine   7   4.01k       0     252   16.84  234.30   60.55       0
# Rows affecte   0       0       0       0       0       0       0       0
# Rows read      0   3.40k       0     211   14.28  202.40   51.53       0
# Bytes sent     6  19.03M  41.35k  83.95k  79.88k  79.83k   6.92k  79.83k
# Tmp tables     0       0       0       0       0       0       0       0
# Tmp disk tbl   0       0       0       0       0       0       0       0
# Tmp tbl size   0       0       0       0       0       0       0       0
# Query size    45  16.92k      71      71      71      71       0      71
# String:
# Databases    wp_blog_one (154/63%), wp_blog_tw... (81/33%)... 1 more
# Hosts
# InnoDB trxID 7F1910 (1/0%), 7F3860 (1/0%), 7F9F74 (1/0%)... 14 more
# Last errno   0
# Users        wp_blog_one (154/63%), wp_blog_two (81/33%)... 1 more
# Query_time distribution
#   1us
#  10us
# 100us
#   1ms
#  10ms
# 100ms
#    1s  ################################################################
#  10s+
# Tables
#    SHOW TABLE STATUS FROM `wp_blog_two` LIKE 'wp_options'\G
#    SHOW CREATE TABLE `wp_blog_two`.`wp_options`\G
# EXPLAIN /*!50100 PARTITIONS*/
SELECT option_name, option_value FROM wp_options WHERE autoload = 'yes'\G

This is the actual part of the output dealing with analysis of the most slowest query ranked #1. The first row in the table above shows the Count of number of times this query was executed. Now let’s take a look at the values in the 95% column, we can clearly see that this query is taking up a lot of execution time (6s) and is sending a lot of rows (202) and a lot of data (79.83k). The “Databases” section of the output also shows the name of the databases where this query was executed. Next the “Query_time distribution” section shows how this query times, which you can see lies in the range 1s to 10s all the time. The “Tables” section lists the queries that you can use to gather more data about the underlying tables involved and the query execution plan used by MySQL.
The end result might be that you end up limiting the number of results returned by the query, by using a LIMIT clause or by filtering based on the option_name column.

Let’s analyze another slow query, this time query ranked #3 by pt-query-digest.

# Query 3: 0 QPS, 0x concurrency, ID 0x8354260420CBD34B at byte 132619 ___
# This item is included in the report because it matches --limit.
# Scores: Apdex = 0.00 [1.0]*, V/M = 0.00
# Query_time sparkline: |       ^|
# Time range: all events occurred at 2011-12-23 17:07:16
# Attribute    pct   total     min     max     avg     95%  stddev  median
# ============ === ======= ======= ======= ======= ======= ======= =======
# Count          0       1
# Exec time      3     23s     23s     23s     23s     23s       0     23s
# Lock time      1   101us   101us   101us   101us   101us       0   101us
# Rows sent     99   8.31M   8.31M   8.31M   8.31M   8.31M       0   8.31M
# Rows examine   0      24      24      24      24      24       0      24
# Rows affecte   0       0       0       0       0       0       0       0
# Rows read     99   8.31M   8.31M   8.31M   8.31M   8.31M       0   8.31M
# Bytes sent    93 275.07M 275.07M 275.07M 275.07M 275.07M       0 275.07M
# Tmp tables     0       0       0       0       0       0       0       0
# Tmp disk tbl   0       0       0       0       0       0       0       0
# Tmp tbl size   0       0       0       0       0       0       0       0
# Query size     0      92      92      92      92      92       0      92
# String:
# Databases    another_db
# Hosts
# InnoDB trxID 83E808
# Last errno   0
# Users        another_db
# Query_time distribution
#   1us
#  10us
# 100us
#   1ms
#  10ms
# 100ms
#    1s
#  10s+  ################################################################
# Tables
#    SHOW TABLE STATUS FROM `another_db` LIKE 'customer'\G
#    SHOW CREATE TABLE `another_db`.`customer`\G
#    SHOW TABLE STATUS FROM `another_db` LIKE 'address'\G
#    SHOW CREATE TABLE `another_db`.`address`\G
#    SHOW TABLE STATUS FROM `another_db` LIKE 'category'\G
#    SHOW CREATE TABLE `another_db`.`category`\G
# EXPLAIN /*!50100 PARTITIONS*/
select customer.store_id, address.address, category.name
  from customer, address, category\G

Let’s again take a look at the 95% column in the above output. And you can clearly see that the problem with this query is that its reading and sending far too many rows (8.31M) and sending far too much data (275M). Going down quickly to the “Tables” section, you can see why that is happening: “Three tables joined without a join condition meaning a cartesian product of all the rows in all the three tables”. See how easily you could pin-point the cause of slowness.
This query was actually executed on a demo of a MySQL Web Client running on one of the MySQL servers managed by me, providing me with good reason to turn on safe-updates option.

Gather Even More Stats with some extra love from Percona Server!

My colleagues at Percona have built in much more verbosity to what is output in the slow query log, which would give you even more insight into what is actually going on behind the scene during the execution of a particular slow query. This is through the edition of a new variable available in Percona Server 5.1 and above called log_slow_verbosity. You can read more about this variable and some other diagnostics added to the Percona server from this link. The variable log_slow_verbosity can have the values ‘microtime’, ‘query_plan’, ‘innodb’ and ‘full’. Let’s turn on log_slow_verbosity as follows:

set session log_slow_verbosity='microtime,query_plan,innodb';

And see how verbose the new entry in slow log is:

# Time: 111228 11:52:30
# User@Host: root[root] @ localhost []
# Thread_id: 57  Schema: mywebsql_demo  Last_errno: 0  Killed: 0
# Query_time: 204.981516  Lock_time: 0.000133  Rows_sent: 10000  Rows_examined: 8721904  Rows_affected: 0  Rows_read: 10000
# Bytes_sent: 164952  Tmp_tables: 2  Tmp_disk_tables: 1  Tmp_table_sizes: 1995142008
# InnoDB_trx_id: 856C30
# QC_Hit: No  Full_scan: Yes  Full_join: Yes  Tmp_table: Yes  Tmp_table_on_disk: Yes
# Filesort: Yes  Filesort_on_disk: Yes  Merge_passes: 662
#   InnoDB_IO_r_ops: 1  InnoDB_IO_r_bytes: 16384  InnoDB_IO_r_wait: 0.034391
#   InnoDB_rec_lock_wait: 0.000000  InnoDB_queue_wait: 0.000000
#   InnoDB_pages_distinct: 11
SET timestamp=1325073150;
select customer.store_id, address.address, category.name   from customer, address, (select * from category) as category order by customer.store_id limit 10000;

See how much more information the slow query log has now, it reports everything from InnoDB stats to on disk filesort was needed for the query, etc. Now you can clearly see this is much much more helpful then the regular slow query statistics reported by the vanilla MySQL.

Conclusion

The only conclusion, I can make out is “Get yourself Percona Server, turn on log_slow_verbosity and start using pt-query-digest”, your job of identifying bottleneck queries will be all the more simpler then.