P.Linux Laboratory

ACMUG 2022-08-28 深圳沙龙

P.Linux — Sun, 07 May 2023 04:46:52 +0000

本文内容遵从CC版权协议, 可以随意转载, 但必须以超链接形式标明文章原始出处和作者信息及版权声明
网址: http://www.penglixun.com/database/acmug-2022-08-28-shenzhen.html

ACMUG 2023-05-06 HTAP专场(北京)

P.Linux — Sun, 07 May 2023 04:33:09 +0000

本文内容遵从CC版权协议, 可以随意转载, 但必须以超链接形式标明文章原始出处和作者信息及版权声明
网址: http://www.penglixun.com/database/acmug-2023-05-06-htap-beijing.html

ACMUG 2022-08-24 成都沙龙

P.Linux — Thu, 04 May 2023 03:37:39 +0000

本文内容遵从CC版权协议, 可以随意转载, 但必须以超链接形式标明文章原始出处和作者信息及版权声明
网址: http://www.penglixun.com/database/acmug-2022-08-24-chengdu.html

分享一些最近的PPT吧

P.Linux — Tue, 23 Aug 2016 14:18:09 +0000

本文内容遵从CC版权协议, 可以随意转载, 但必须以超链接形式标明文章原始出处和作者信息及版权声明
网址: http://www.penglixun.com/database/share_ppts_2016.html

好久不写博客都长草了，贴一点最新的PPT。

阿里MySQL内核月报

P.Linux — Fri, 12 Jun 2015 05:20:05 +0000

本文内容遵从CC版权协议, 可以随意转载, 但必须以超链接形式标明文章原始出处和作者信息及版权声明
网址: http://www.penglixun.com/database/alibaba_rds_mysql_kernel_monthly.html

有同学问为什么我好久没写MySQL文章了，其实不是没写，是都写到咱们阿里云RDS MySQL数据库内核月报了，下面是地址。
Somebody asks me: why are you not writing mysql posts so long? In fact, it’s not my lazy, because of I post my research on . Following are URLs:

比较旧的归档在这里：
Older archives here:
http://mysql.taobao.org/index.php?title=%E8%B5%84%E6%96%99%E5%85%B1%E4%BA%AB#MySQL_.E5.86.85.E6.A0.B8.E6.9C.88.E6.8A.A5

新的月报都在这里可以看到：
And new posts here:
http://mysql.taobao.org/monthly/

我有空也会挑选一些很有价值的翻译成英文，给外国朋友们看。
If I’m free, I will choose some valuable articles from our to translate to English for foreign friends. ：-）

一些Git操作的技巧

P.Linux — Fri, 12 Jun 2015 05:10:10 +0000

本文内容遵从CC版权协议, 可以随意转载, 但必须以超链接形式标明文章原始出处和作者信息及版权声明
网址: http://www.penglixun.com/tech/program/some_git_skills.html

最近1年开发从SVN/Bzr换到了Git，总得来说Git还是很好用的，这里总结了一些不错的命令。

git stash

我们有时会遇到这样的情况，正在分支a上开发一半，然后分支b上发现Bug，需要马上处理。
这时候分支a上的修改怎么办呢，git add 是不行的，有的git客户端版本会提示还有add过的文件没提交不能切换分支，有的git客户端版本会把修改带到b分支。

git stash 就是解决这个问题，它把当前工作区的修改和git add的内容都保存到一个地方，然后git reset HEAD，使工作区回到上一次提交，处于干净状态。然后就可以很放心的切到另外的分支b干活了。

git stash save “先给我保存一下，我要去别的分支修bug”
git stash list
git stash pop
git stash apply stash@{num}

git rebase

有的时候我们在一个分支a开发的时候，master已经进入了很多修改，这时候如果把a的修改提交上去，可能就会跟主干有冲突，需要在主干解决冲突才能提交，这样比较难看。

这时候git rebase就有用了，git rebase BRANCH_NAME可以把BRANCH_NAME分支的修改带到当前分支来，这样当前分支就有了BRANCH_NAME分支的所有内容，这样在当前分支开发的内容提交以后不会跟BRANCH_NAME有冲突，冲突在当前分支就可以解决。

git reset

可以取消已经提交的commit，一般我们只用git reset HEAD^。因为每个分支可能开发过程中为了保存过程以便回溯会有很多commit，但是我们要求进入主干时，每个功能和bugfix只能有一个提交，因此可以先用git reset退回到最早的commit，然后把自己的修改最后打包成一个commit，再去跟主干合并。

利用这两个命令，我们可以很好的管理我们的MySQL开发。我们只有一个master分支作为主干，不允许在主干上直接开发。每个同学根据feature和bug的issue建立分支，然后在分支上开发，不管开发过程中有多少个commit，我们要求最终提交每个bugfix或feature只能有一个提交。因此每个同学完成开发后，都需要git reset 退到最早的commit，git stash save宝存一下自己的修改，然后git checkout master; git pull拖一下最新的主干，然后返回自己的分支，再做git rebase master，把当前分支推进到主干，最后git stash pop弹出修改，有冲突则在当前分支解决，再git push。

RDS 高可用保障之 – 隐式主键

P.Linux — Tue, 22 Apr 2014 13:05:36 +0000

本文内容遵从CC版权协议, 可以随意转载, 但必须以超链接形式标明文章原始出处和作者信息及版权声明
网址: http://www.penglixun.com/database/aliyun_rds_implicit_primary_key.html

原文发在：阿里云产品博客

在构建稳定可靠的应用架构时，数据库是最底层、最稳定的组件之一；而在云环境中，RDS 提供一个7*24小时不间接访问的云服务，可用性达到99.95%.

RDS 采用主备复制架构，用户购买一个实例，RDS都会提供一个性能对等的备库用于保证高可用。高可用性组件（AURORA）会每3秒检查主库(Master)状态，当发现 Master 出现Down机时可以将用户的SQL请求快速转移到备库(Slave)上面。

图1 – RDS 架构图

在这样的架构设计下， RDS需要保证主备数据一致性并且延时不超过10秒，以快速完成主备切换；否则，RDS会保证一致性而牺牲可用性，必须等待数据同步一致再进行切换；所以主备延时会直接影响服务的高可用性；

数据复制可靠性

MySQL复制模式可以通过参数：BINLOG_FORMAT进行配置；在MySQL5.1以前，MySQL默认采用 Statement 模式进行数据复制，这种模式下有可能会让主备数据产生不一致情况，比如使用UUID等函数；MySQL在5.1版本以后，提供了基于ROW 模式的复制模式，从而大大提高了数据复制的可靠性；但这种模式在以下场景下会让备库的数据延时很大；

1) 存在没有主键的表，导致备库应用每个Event 都需要全表扫描 ;
2) 主库执行了大表DDL 或大事务，导致备库也要相同时间执行完 ;

RDS在实际的运行过程中发现，99%以上的主备延时，都是因为用户在建表的时候没有指定主键；RDS 曾经尝试过临时解决方案，把有延迟的实例日志格式改为MIXED，无主键表的操作用STATEMENT 格式记录，但这种方案还是有可能产生主备数据不一致；

ROW模式数据复制
ROW 模式之所以能保证复制可靠性，是其在BINLOG里记录每一行完整记录，包括所有列的值；在备库应用日志时，MySQL 会先尝试用行里的主键去匹配自身的记录，如果没有主键，则进行全表扫描所有的行，每一行都与日志进行匹配，直到发现完全匹配的行；

图2- ROW模式日志匹配处理流程

方案设计

在保证主备数据复制可靠性的前提下，减小主备延时；

方案一：提醒用户去加上主键，问题迎刃而解；但在实际的实施过程中，这根本不现实，用户的学习成本、应用兼容、实施成本远远超过我们的想像；

方案二：这也是云平台自身要解决的问题，用户不应该去关注这些问题；让MySQL 在底层能智能的处理，对用户透明，兼大欢喜；

对于方案二，有两个解决思路：

1. 为什么ROW格式日志一定要用主键定位记录，如果用二级索引行不行？虽然没有主键那么精准，但至少可以避免全表扫描

2. InnoDB 引擎也是严重依赖主键，它对于没有主键的表，就自己强制加进去一个主键对用户隐藏，MySQL Server层可否也这样实现？

思路一，需要考虑的问题主要是成本开销：

图3-利用二级索引处理无主键的ROW格式Event

如果像执行SQL一样，每一行都走一遍执行计划看哪个二级索引比较好，那么速度一样快不起来.主库只对每个SQL走一次执行计划选择一个索引，备库需要对这个SQL影响的所有行记录都重新生成一次执行计划。

因为ROW格式中的行包含了所有列，所以更合理的方案是，选择一个固定规则的二级索引即可，总是有列可以被用上进行过滤。例如总是利用第一个二级索引，这样不需要走执行计划，可以大大节省生成执行计划的时间，而且有这个规则，也可以调整二级索引的位置，来匹配这个规则，让过滤性好的二级索引调整到可以被利用的位置。

幸好，MariaDB开发了一个这样的补丁，对于有二级索引而没有主键的表来说，效果还不错。

思路二，要解决的情况是：完全没有任何索引、以及二级索引过滤性都不好的情况（比如，性别字段）。这里我们考虑过把InnoDB的二级索引直接引用到Server层来，但是如此一来，对于使用MyISAM表的用户，还是没有效果，所以需要一个更通用的设计方案：MySQL可以自动会用户添加主键而对应用透明 – 隐式主键（Implict Primary Key）。最终，我们采取了这样的设计方案：

1 打开RDS 特有的参数implict_primary_key，让隐式主键功能生效；
2 当用户建表（CREATE TABLE）时，判断表结构

2.1 如果表上有主键，则pass
2.2 如果表上没有主键，有唯一键，则把唯一键放在索引的第一个位置，可以利用二级索引补丁；
2.3 如果表上没有主键，也没有唯一键，则为用户建立一个特定名称的自增主键；

3 当用户修改表结构（ALTER TABLE）时，判断新表结构

3.1 如果用户自己添加了主键或唯一键，则删除系统添加的主键
3.2 如果用户删除了原有的主键和唯一键，则为用户建立一个特定名称的自增主键

4 用户做DML操作时，屏蔽这个隐式主键的存在

4.1 INSERT INTO table VALUES (…)，用户不需要在VALUES中填写主键的值，系统会自动填充NULL，从而在写入数据库时自动填入自增值
4.2 SELECT * FROM table，行数据返回给用户前，自动过滤了隐式主键列
4.3 LOAD DATA INFILE，用户不会感知到表中存在主键，系统会自动填充NULL来使用自增主键值
4.4 SHOW CREATE TABLE/SHOW COLUMNS等SHOW语句，生成结构语句时自动过滤隐式主键列，用户不会看到有主键列

5 对于系统用户（root）需要查看真实情况的，提供show_ipk_info参数，SET show_ipk_info=1，则可以查看隐式主键，不会进行任何隐藏操作
6 如果implict_primary_key参数关闭，则隐式主键功能不再发挥作用，即当用户进行DDL操作时，如果原来表上有隐式主键，则会趁用户DDL之机一起删除。但是原有的没有删除的隐式主键列并不会显示给用户，会一直隐藏。

图4 – 隐含主键操作展示；

基于思路2，RDS MySQL 源代码团队已经完成开发适应于RDS场景的MySQL Patch，并全面覆盖到RDS所有MySQL5.1和 MySQL 5.5服务器中，目前运行稳定 ;

迁移Bzr代码库到Git库中

P.Linux — Fri, 18 Apr 2014 09:35:13 +0000

本文内容遵从CC版权协议, 可以随意转载, 但必须以超链接形式标明文章原始出处和作者信息及版权声明
网址: http://www.penglixun.com/database/coverting_bzr_to_git.html

最近连续做了两次任务，都是把Bzr的代码转移到Git中，这里记录一下操作步骤。

目标：把MariaDB 10.0.10的GA版本代码库导入公司的Gitlab中。

这里需要用到bzr的fastimport工具，可以从lp上获得最新的代码，放在bzr的plugin目录下。

➜ /Users/plx >cd ~/.bazaar/plugins
➜ /Users/plx/.bazaar/plugins >bzr branch lp:bzr-fastimport fastimport

然后执行bzr selftest fastimport会提示你python缺乏各种包，用easy_install安装即可。
特别注明的是，python-fastimport包必须安装0.92以下版本，否则跟bzr-fastimport不兼容，会缺少两个函数。

然后查看MariaDB 10.0.10版本对应的版本号，通过tag来查询：

➜ /Users/plx/Documents/Code/MariaDB/mariadb-10 >bzr tags | grep 10.0.10
mariadb-10.0.10      4140

这里可以看到10.0.10版本对应的tag号为4140，然后我们导出一份4140版本号的代码来操作。

bzr branch -r4140 lp:maria

完成之后就可以用bzr-fastimport工具了：

cd maria
git init
bzr fast-export --plain . | git fast-import

Gitlab上已经创建好一个m_10010的空项目，转换完成之后就可以上传到Gitlab了：

rm -rf bzr
rm -rf .bzr*
git remote add origin git@gitlab.alibaba-inc.com:m/m_10010.git
git push --mirror git@gitlab.alibaba-inc.com:m/m_10010.git

完成！

MySQL中无GROUP BY直接HAVING的问题

P.Linux — Thu, 08 Aug 2013 03:36:18 +0000

本文内容遵从CC版权协议, 可以随意转载, 但必须以超链接形式标明文章原始出处和作者信息及版权声明
网址: http://www.penglixun.com/database/having_without_groupby_in_mysql.html

今天有同学给我反应，有一张表，id是主键，这样的写法可以返回一条记录：

“SELECT * FROM t HAVING id=MIN(id);”

但是只是把MIN换成MAX，这样返回就是空了：

“SELECT * FROM t HAVING id=MAX(id);”

这是为什么呢？

我们先来做个试验，验证这种情况。
这是表结构，初始化两条记录，然后试验：

root@localhost : plx 10:25:10> show create table t2\G
*************************** 1. row ***************************
       Table: t2
Create Table: CREATE TABLE `t2` (
  `a` int(11) DEFAULT NULL,
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=5 DEFAULT CHARSET=utf8

root@localhost : plx 10:25:15> select * from t2;
+------+----+
| a    | id |
+------+----+
|    1 |  1 |
|    1 |  3 |
+------+----+
2 rows in set (0.00 sec)

root@localhost : plx 10:25:20> SELECT * FROM t2 HAVING id=MIN(id);
+------+----+
| a    | id |
+------+----+
|    1 |  1 |
+------+----+
1 row in set (0.00 sec)

root@localhost : plx 10:25:30> SELECT * FROM t2 HAVING id=MAX(id);
Empty set (0.00 sec)

初看之下，好像真的是这样哎，怎么会这样呢？

我再试一下，把a字段改一个为10，然后试下a字段：

root@localhost : plx 10:26:58> select * from t2;
+------+----+
| a    | id |
+------+----+
|   10 |  1 |
|    1 |  3 |
+------+----+
2 rows in set (0.00 sec)

root@localhost : plx 10:28:20> SELECT * FROM t2 HAVING a=MAX(a);
+------+----+
| a    | id |
+------+----+
|   10 |  1 |
+------+----+
1 row in set (0.00 sec)

root@localhost : plx 10:28:28> SELECT * FROM t2 HAVING a=MIN(a);
Empty set (0.00 sec)

我擦，这回MAX能返回，MIN不能了，这又是为啥呢？

旁白
一般来说，HAVING子句是配合GROUP BY使用的，单独使用HAVING本身是不符合规范的，
但是MySQL会做一个重写，加上一个GROUP BY NULL，”SELECT * FROM t HAVING id=MIN(id)”会被重写为”SELECT * FROM t GROUP BY NULL HAVING id=MIN(id)”，这样语法就符合规范了。

继续……
但是，这个 GROUP BY NULL 会产生什么结果呢？经过查看代码和试验，可以证明，GROUP BY NULL 等价于 LIMIT 1：

root@localhost : plx 10:25:48> SELECT * FROM t2 GROUP BY NULL;
+------+----+
| a    | id |
+------+----+
|   10 |  1 |
+------+----+
1 row in set (0.00 sec)

也就是说，GROUP BY NULL 以后，只会有一个分组，里面就是第一行数据。
但是如果这样，MIN、MAX结果应该是一致的，那也不应该MAX和MIN一个有结果，一个没结果啊，这是为什么呢，再做一个测试。
修改一下数据，然后直接查看MIN/MAX的值：

root@localhost : plx 10:26:58> select * from t2;
+------+----+
| a    | id |
+------+----+
|   10 |  1 |
|    1 |  3 |
+------+----+
2 rows in set (0.00 sec)

root@localhost : plx 10:27:04> SELECT * FROM t2 GROUP BY NULL;
+------+----+
| a    | id |
+------+----+
|   10 |  1 |
+------+----+
1 row in set (0.00 sec)

root@localhost : plx 10:30:21> SELECT MAX(a),MIN(a),MAX(id),MIN(id) FROM t2 GROUP BY NULL;
+--------+--------+---------+---------+
| MAX(a) | MIN(a) | MAX(id) | MIN(id) |
+--------+--------+---------+---------+
|     10 |      1 |       3 |       1 |
+--------+--------+---------+---------+
1 row in set (0.00 sec)

是不是发现问题了？
MAX/MIN函数取值是全局的，而不是LIMIT 1这个分组内的。
因此，当GROUP BY NULL的时候，MAX/MIN函数是取所有数据里的最大和最小值！

所以啊，”SELECT * FROM t HAVING id=MIN(id)”本质上是”SELECT * FROM t HAVING id=1″, 就能返回一条记录，而”SELECT * FROM t HAVING id=MAX(id)”本质上是”SELECT * FROM t HAVING id=3″，当然没有返回记录，这就是问题的根源。

测试一下GROUP BY a，这样就对了，每个分组内只有一行，所以MAX/MIN一样大，这回是取得组内最大和最小值。

root@localhost : plx 11:29:49> SELECT MAX(a),MIN(a),MAX(id),MIN(id) FROM t2 GROUP BY a;
+--------+--------+---------+---------+
| MAX(a) | MIN(a) | MAX(id) | MIN(id) |
+--------+--------+---------+---------+
|      1 |      1 |       3 |       3 |
|     10 |     10 |       5 |       5 |
+--------+--------+---------+---------+
2 rows in set (0.00 sec)

GROUP BY NULL时MAX/MIN的行为，是这个问题的本质，所以啊，尽量使用标准语法，玩花样SQL之前，一定要搞清楚它的行为是否与理解的一致。

Enjoy MySQL!

mysqldump的流程

P.Linux — Fri, 29 Mar 2013 06:19:45 +0000

本文内容遵从CC版权协议, 可以随意转载, 但必须以超链接形式标明文章原始出处和作者信息及版权声明
网址: http://www.penglixun.com/database/the_process_of_mysqldump.html

前几天看到群里在讨论mysqldump导致锁表的问题，为什么一个表已经dump完了还会被锁住？mysqldump里面到底是怎么处理的，为了解答这些问题，就来看看mysqldump.cc中的实现吧。

重要参数

首先我们把参数和内部变量对应起来，并且看一下它们的注释：

–single-transaction: opt_single_transaction

Creates a consistent snapshot by dumping all tables in a single transaction. Works ONLY for tables stored in storage engines which support multiversioning (currently only InnoDB does); the dump is NOT guaranteed to be consistent for other storage engines. While a –single-transaction dump is in process, to ensure a valid dump file (correct table contents and binary log position), no other connection should use the following statements: ALTER TABLE, DROP TABLE, RENAME TABLE, TRUNCATE TABLE, as consistent snapshot is not isolated from them. Option automatically turns off –lock-tables.

通过将导出操作封装在一个事务内来使得导出的数据是一个一致性快照。只有当表使用支持MVCC的存储引擎（目前只有InnoDB）时才可以工作；其他引擎不能保证导出是一致的。当导出开启了–single-transaction选项时，要确保导出文件有效（正确的表数据和二进制日志位置），就要保证没有其他连接会执行如下语句：ALTER TABLE, DROP TABLE, RENAME TABLE, TRUNCATE TABLE，这会导致一致性快照失效。这个选项开启后会自动关闭lock-tables。

–master-data: opt_master_data

This causes the binary log position and filename to be appended to the output. If equal to 1, will print it as a CHANGE MASTER command; if equal to 2, that command will be prefixed with a comment symbol. This option will turn –lock-all-tables on, unless –single-transaction is specified too (in which case a global read lock is only taken a short time at the beginning of the dump; don’t forget to read about –single-transaction below). In all cases, any action on logs will happen at the exact moment of the dump. Option automatically turns –lock-tables off.

这个选项可以把binlog的位置和文件名添加到输出中，如果等于1，将会打印成一个CHANGE MASTER命令；如果等于2，会加上注释前缀。并且这个选项会自动打开–lock-all-tables，除非同时设置了–single-transaction（这种情况下，全局读锁只会在开始dump的时候加上一小段时间，不要忘了阅读–single-transaction的部分）。在任何情况下，所有日志中的操作都会发生在导出的准确时刻。这个选项会自动关闭–lock-tables。

–lock-all-tables: opt_lock_all_tables

Locks all tables across all databases. This is achieved by taking a global read lock for the duration of the whole dump. Automatically turns –single-transaction and –lock-tables off.

锁定所有库中所有的表。这是通过在整个dump的过程中持有全局读锁来实现的。会自动关闭–single-transaction 和 –lock-tables。

–lock-tables: lock_tables

Lock all tables for read. (Defaults to on; use –skip-lock-tables to disable.)

对所有表加读锁。（默认是打开的；用–skip-lock-tables来关闭）

–flush-logs: flush_logs

Flush logs file in server before starting dump. Note that if you dump many databases at once (using the option –databases= or –all-databases), the logs will be flushed for each database dumped. The exception is when using –lock-all-tables or –master-data: in this case the logs will be flushed only once, corresponding to the moment all tables are locked. So if you want your dump and the log flush to happen at the same exact moment you should use –lock-all-tables or –master-data with –flush-logs。

在开始导出前刷新服务器的日志文件。注意，如果你一次性导出很多数据库（使用 –databases= 或 –all-databases 选项），导出每个库时都会触发日志刷新。例外是当使用了 –lock-all-tables 或 –master-data 时：日志只会被刷新一次，那个时候所有表都会被锁住。所以如果你希望你的导出和日志刷新发生在同一个确定的时刻，你需要使用–lock-all-tables，或者 –master-data 配合 –flush-logs。

–delete-master-logs: opt_delete_master_logs

Delete logs on master after backup. This automatically enables –master-data.

备份完成后删除主库上的日志。这个选项会自动打开 –master-data.

–apply-slave-statements: opt_slave_apply（5.5）

Adds ‘STOP SLAVE’ prior to ‘CHANGE MASTER’ and ‘START SLAVE’ to bottom of dump.

在’CHANGE MASTER’前加上’STOP SLAVE’，在导出文件的末尾加上’START SLAVE’.

主要代码流程

我们分别看一下5.1和5.5的代码，都基于最新的trunk（5.1-rev.3909; 5.5-rev.4148）。

5.1版本主要流程

我们首先看下5.1版本的。

5320   if ((opt_lock_all_tables || opt_master_data) &&
5321       do_flush_tables_read_lock(mysql))
5322     goto err;

如果设置了master-data或lock-all-tables，则做FLUSH TABLES的操作。
来看下do_flush_tables_read_lock()里面是怎么做的，

do_flush_tables_read_lock()
4665   return
4666     ( mysql_query_with_error_report(mysql_con, 0, 
4667                                     ((opt_master_data != 0) ? // 如果设置了--master-data
4668                                         "FLUSH /*!40101 LOCAL */ TABLES" : // 那么用FLUSH LOCAL TABLES 
4669                                         "FLUSH TABLES")) || // 如果没设置那么使用FLUSH TABLE
4670       mysql_query_with_error_report(mysql_con, 0,
4671                                     "FLUSH TABLES WITH READ LOCK") ); // 如果上面的语句执行成功了，再执行这个

先FLUSH TABLES，成功后用FLUSH TABLES WITH READ LOCK加全局读锁。
再往下会判断single-transaction，

5323   if (opt_single_transaction && start_transaction(mysql))
5324       goto err;

如果定义了–single-transaction则打开一个事务来读取数据。
我们看下start_transaction()的实现，

start_transaction()
4741   return (mysql_query_with_error_report(mysql_con, 0,
4742                                         "SET SESSION TRANSACTION ISOLATION "
4743                                         "LEVEL REPEATABLE READ") || // 先设置会话的隔离级别为RR
4744           mysql_query_with_error_report(mysql_con, 0,
4745                                         "START TRANSACTION "
4746                                         "/*!40100 WITH CONSISTENT SNAPSHOT */")); // 再用一致性快照模式(RR)启动事务

会先设置隔离级别为RR，然后START TRANSACTION加上一致性快照的Hint。
接下来是获取Master的状态，

5338   if (opt_master_data && do_show_master_status(mysql))
5339     goto err;

如果设置了–master-data 则把当前的Master status打印出来。
接下来再判断如果启用了–single-transaction，则可以释放表锁的，因为事务已经启动了。

5340   if (opt_single_transaction && do_unlock_tables(mysql)) /* unlock but no commit! */
5341     goto err;

do_unlock_tables()里面就发一条UNLOCK TABLES语句释放全局表锁。

do_unlock_tables()
4677   return mysql_query_with_error_report(mysql_con, 0, "UNLOCK TABLES");

然后开始调用dump_*函数根据需要导出整个实例或者一个库或者一个表。

dump_all_databases()->dump_all_tables_in_db()
4307   if (lock_tables)
4308   {
4309     DYNAMIC_STRING query;
4310     init_dynamic_string_checked(&query, "LOCK TABLES ", 256, 1024);
4311     for (numrows= 0 ; (table= getTableName(1)) ; )
4312     {
4313       char *end= strmov(afterdot, table);
4314       if (include_table((uchar*) hash_key,end - hash_key))
4315       {
4316         numrows++;
4317         dynstr_append_checked(&query, quote_name(table, table_buff, 1));
4318         dynstr_append_checked(&query, " READ /*!32311 LOCAL */,");                                                                                                  
4319       }
4320     }
4321     if (numrows && mysql_real_query(mysql, query.str, query.length-1))
4322       DB_error(mysql, "when using LOCK TABLES");
4323             /* We shall continue here, if --force was given */
4324     dynstr_free(&query);
4325   }
/* 如果设置了--lock-tables（默认），则导出之前需要LOCK TABLES tables_name READ。*/
...
4332   while ((table= getTableName(0)))
4333   {
4334     char *end= strmov(afterdot, table);
4335     if (include_table((uchar*) hash_key, end - hash_key))
4336     {
4337       dump_table(table,database); // 导出一张表
4338       my_free(order_by, MYF(MY_ALLOW_ZERO_PTR));
4339       order_by= 0;
4340       if (opt_dump_triggers && mysql_get_server_version(mysql) >= 50009)
4341       {
4342         if (dump_triggers_for_table(table, database)) // 导出 trigger
4343         {
4344           if (path)
4345             my_fclose(md_result_file, MYF(MY_WME));
4346           maybe_exit(EX_MYSQLERR);
4347         }
4348       }
4349     }
4350   } 
/* 先dump_table来导出表，然后再看是不是配置了--triggers来决定是不是导出Trigger，dump_triggers_for_table。*/
...
4366   if (lock_tables)
4367     VOID(mysql_query_with_error_report(mysql, 0, "UNLOCK TABLES")); 
/* 导出完成之后，释放表锁 */

所以我们可以知道，如果用–master-data和–single-transaction来导出数据，因为–lock-tables被自动关闭，所以导出过程中只会对当前正在做导出操作的表有IS锁，已经完成或没有开始的表，则不会加锁。
如果用的是默认–lock-tables打开的选项，则会先把所有库的锁加上，再进行导出操作，最后一次性释放所有锁。

5.5版本主要流程

接下来我们再比较一下，5.5的mysqldump有哪些变化。

5464   if ((opt_lock_all_tables || opt_master_data ||
5465        (opt_single_transaction && flush_logs)) &&
5466       do_flush_tables_read_lock(mysql))
5467     goto err;

这里有所不同，增加了flush_logs的判断，如果只是单纯的–single-transaction，不会调用do_flush_tables_read_lock()，必须同时制定–flush-logs。

5469   /*
5470     Flush logs before starting transaction since
5471     this causes implicit commit starting mysql-5.5.
5472   */
5473   if (opt_lock_all_tables || opt_master_data ||
5474       (opt_single_transaction && flush_logs) ||
5475       opt_delete_master_logs)
5476   {
5477     if (flush_logs || opt_delete_master_logs)
5478     {
5479       if (mysql_refresh(mysql, REFRESH_LOG))
5480         goto err;
5481       verbose_msg("-- main : logs flushed successfully!\n");
5482     }
5483 
5484     /* Not anymore! That would not be sensible. */
5485     flush_logs= 0;
5486   }

5.5里面会尝试FLUSH LOGS。

5488   if (opt_delete_master_logs)
5489   {
5490     if (get_bin_log_name(mysql, bin_log_name, sizeof(bin_log_name)))
5491       goto err;
5492   }

5.5新增的变量，删除master上的log，这里先获取binlog的文件名。

5494   if (opt_single_transaction && start_transaction(mysql))
5495     goto err;

这一段没有变化

5497   /* Add 'STOP SLAVE to beginning of dump */
5498   if (opt_slave_apply && add_stop_slave())
5499     goto err;
5500   if (opt_master_data && do_show_master_status(mysql))
5501     goto err;
5502   if (opt_slave_data && do_show_slave_status(mysql))
5503     goto err;
5504   if (opt_single_transaction && do_unlock_tables(mysql)) /* unlock but no commit! */
5505     goto err;

这里有新加的opt_slave_apply和opt_slave_data部分，添加STOP SLAVE语句和显示SHOW SALVE STATUS的结果。
之后也是调用dump_*来导出数据。
但是因为5.5有了MDL（Meta data lock），所以–single-transaction时，事务内操作过的表都会持有MDL，因此不会被DDL破坏。
例如，mysqldump已经备份了a，b，c表，因为它们在事务内，事务还没提交，它们的MDL不会释放，因此另外的线程如果做a,b,c中任意一张表的DDL操作，都会出现Waiting for table metadata lock，而还没备份到的表不会持有MDL，因此还可以做DDL。

InnoDB建表时设定初始大小 (Setting InnoDB table datafile initial size when create new table)

P.Linux — Mon, 03 Dec 2012 04:34:23 +0000

本文内容遵从CC版权协议, 可以随意转载, 但必须以超链接形式标明文章原始出处和作者信息及版权声明
网址: http://www.penglixun.com/database/setting_innodb_table_initial_size.html

InnoDB在写密集的压力时，由于B-Tree扩展，因而也会带来数据文件的扩展，然而，InnoDB数据文件扩展需要使用mutex保护数据文件，这就会导致波动。丁奇的博客说明了这个问题：http://dinglin.iteye.com/blog/1317874

When InnoDB under heavy write workload, datafiles will extend quickly, because of B-Tree allocate new pages. But InnoDB need to use mutex to protect datafile, so it will cause performance jitter. Xiaobin Lin said this in his blog: http://dinglin.iteye.com/blog/1317874

解决的方法也很简单，只要知道数据文件可能会增长到多大，预先扩展即可。阅读代码可以知道，InnoDB建表后自动初始化大小是FIL_IBD_FILE_INITIAL_SIZE这个常量控制的，而初始化数据文件是由fil_create_new_single_table_tablespace()函数控制的。所以要改变数据文件初始化大小，只要修改fil_create_new_single_table_tablespace的传入值即可，默认是FIL_IBD_FILE_INITIAL_SIZE。

How to solve it? That’s easy. If we know the datafile will extend to which size at most, we can pre-extend it. After reading source code, we can know InnoDB initial datafile size by FIL_IBD_FILE_INITIAL_SIZE, and fil_create_new_single_table_tablespace() function to do it. So if we want to change datafile initial size, we only need to change the initial size parameter in fil_create_new_single_table_tablespace(), the default value is FIL_IBD_FILE_INITIAL_SIZE.

因此，我在建表语法中加上了datafile_initial_size这个参数，例如：
CREATE TABLE test (
…
) ENGINE = InnoDB DATAFILE_INITIAL_SIZE=100000;
如果设定的值比FIL_IBD_FILE_INITIAL_SIZE还小，就依然传入FIL_IBD_FILE_INITIAL_SIZE给fil_create_new_single_table_tablespace，否则传入datafile_initial_size进行初始化。

So, I add a new parameter for CREATE TABLE, named ‘datafile_initial_size’. For example:
CREATE TABLE test (
…
) ENGINE = InnoDB DATAFILE_INITIAL_SIZE=100000;
If DATAFILE_INITIAL_SIZE value less than FIL_IBD_FILE_INITIAL_SIZE, I will still pass FIL_IBD_FILE_INITIAL_SIZE to fil_create_new_single_table_tablespace(), otherwise, I pass DATAFILE_INITIAL_SIZE value to fil_create_new_single_table_tablespace() function for initialization.

因此，这个简单安全的patch就有了，可以看 http://bugs.mysql.com/bug.php?id=67792 关注官方的进展：
So, I wrote this simple patch, see http://bugs.mysql.com/bug.php?id=67792:

Index: storage/innobase/dict/dict0crea.c
===================================================================
--- storage/innobase/dict/dict0crea.c	(revision 3063)
+++ storage/innobase/dict/dict0crea.c	(working copy)
@@ -294,7 +294,8 @@
 		error = fil_create_new_single_table_tablespace(
 			space, path_or_name, is_path,
 			flags == DICT_TF_COMPACT ? 0 : flags,
-			FIL_IBD_FILE_INITIAL_SIZE);
+			table->datafile_initial_size < FIL_IBD_FILE_INITIAL_SIZE ? 
+        FIL_IBD_FILE_INITIAL_SIZE : table->datafile_initial_size);
 		table->space = (unsigned int) space;
 
 		if (error != DB_SUCCESS) {
Index: storage/innobase/handler/ha_innodb.cc
===================================================================
--- storage/innobase/handler/ha_innodb.cc	(revision 3063)
+++ storage/innobase/handler/ha_innodb.cc	(working copy)
@@ -7155,6 +7155,7 @@
 			col_len);
 	}
 
+  table->datafile_initial_size= form->datafile_initial_size;
 	error = row_create_table_for_mysql(table, trx);
 
 	if (error == DB_DUPLICATE_KEY) {
@@ -7760,6 +7761,7 @@
 
 	row_mysql_lock_data_dictionary(trx);
 
+  form->datafile_initial_size= create_info->datafile_initial_size;
 	error = create_table_def(trx, form, norm_name,
 		create_info->options & HA_LEX_CREATE_TMP_TABLE ? name2 : NULL,
 		flags);
Index: storage/innobase/include/dict0mem.h
===================================================================
--- storage/innobase/include/dict0mem.h	(revision 3063)
+++ storage/innobase/include/dict0mem.h	(working copy)
@@ -678,6 +678,7 @@
 /** Value of dict_table_struct::magic_n */
 # define DICT_TABLE_MAGIC_N	76333786
 #endif /* UNIV_DEBUG */
+  uint datafile_initial_size; /* the initial size of the datafile */
 };
 
 #ifndef UNIV_NONINL
Index: support-files/mysql.5.5.18.spec
===================================================================
--- support-files/mysql.5.5.18.spec	(revision 3063)
+++ support-files/mysql.5.5.18.spec	(working copy)
@@ -244,7 +244,7 @@
 Version:        5.5.18
 Release:        %{release}%{?distro_releasetag:.%{distro_releasetag}}
 Distribution:   %{distro_description}
-License:        Copyright (c) 2000, 2011, %{mysql_vendor}. All rights reserved. Under %{license_type} license as shown in the Description field.
+License:        Copyright (c) 2000, 2012, %{mysql_vendor}. All rights reserved. Under %{license_type} license as shown in the Description field.
 Source:         http://www.mysql.com/Downloads/MySQL-5.5/%{src_dir}.tar.gz
 URL:            http://www.mysql.com/
 Packager:       MySQL Release Engineering 
Index: sql/table.h
===================================================================
--- sql/table.h	(revision 3063)
+++ sql/table.h	(working copy)
@@ -596,6 +596,7 @@
   */
   key_map keys_in_use;
   key_map keys_for_keyread;
+  uint datafile_initial_size; /* the initial size of the datafile */
   ha_rows min_rows, max_rows;		/* create information */
   ulong   avg_row_length;		/* create information */
   ulong   version, mysql_version;
@@ -1094,6 +1095,8 @@
 #endif
   MDL_ticket *mdl_ticket;
 
+  uint datafile_initial_size;
+
   void init(THD *thd, TABLE_LIST *tl);
   bool fill_item_list(List *item_list) const;
   void reset_item_list(List *item_list) const;
Index: sql/sql_yacc.yy
===================================================================
--- sql/sql_yacc.yy	(revision 3063)
+++ sql/sql_yacc.yy	(working copy)
@@ -906,6 +906,7 @@
 %token  DATABASE
 %token  DATABASES
 %token  DATAFILE_SYM
+%token  DATAFILE_INITIAL_SIZE_SYM
 %token  DATA_SYM                      /* SQL-2003-N */
 %token  DATETIME
 %token  DATE_ADD_INTERVAL             /* MYSQL-FUNC */
@@ -5046,6 +5047,18 @@
             Lex->create_info.db_type= $3;
             Lex->create_info.used_fields|= HA_CREATE_USED_ENGINE;
           }
+        | DATAFILE_INITIAL_SIZE_SYM opt_equal ulonglong_num
+          {
+            if ($3 > UINT_MAX32)
+            {
+              Lex->create_info.datafile_initial_size= UINT_MAX32;
+            }
+            else
+            {
+              Lex->create_info.datafile_initial_size= $3;
+            }
+            Lex->create_info.used_fields|= HA_CREATE_USED_DATAFILE_INITIAL_SIZE;
+          }
         | MAX_ROWS opt_equal ulonglong_num
           {
             Lex->create_info.max_rows= $3;
@@ -12585,6 +12598,7 @@
         | CURSOR_NAME_SYM          {}
         | DATA_SYM                 {}
         | DATAFILE_SYM             {}
+        | DATAFILE_INITIAL_SIZE_SYM{}
         | DATETIME                 {}
         | DATE_SYM                 {}
         | DAY_SYM                  {}
Index: sql/handler.h
===================================================================
--- sql/handler.h	(revision 3063)
+++ sql/handler.h	(working copy)
@@ -387,6 +387,8 @@
 #define HA_CREATE_USED_TRANSACTIONAL    (1L << 20)
 /** Unused. Reserved for future versions. */
 #define HA_CREATE_USED_PAGE_CHECKSUM    (1L << 21)
+/** Used for InnoDB initial table size. */
+#define HA_CREATE_USED_DATAFILE_INITIAL_SIZE (1L << 22)
 
 typedef ulonglong my_xid; // this line is the same as in log_event.h
 #define MYSQL_XID_PREFIX "MySQLXid"
@@ -1053,6 +1055,7 @@
   LEX_STRING comment;
   const char *data_file_name, *index_file_name;
   const char *alias;
+  uint datafile_initial_size; /* the initial size of the datafile */
   ulonglong max_rows,min_rows;
   ulonglong auto_increment_value;
   ulong table_options;
Index: sql/lex.h
===================================================================
--- sql/lex.h	(revision 3063)
+++ sql/lex.h	(working copy)
@@ -153,6 +153,7 @@
   { "DATABASE",		SYM(DATABASE)},
   { "DATABASES",	SYM(DATABASES)},
   { "DATAFILE", 	SYM(DATAFILE_SYM)},
+  { "DATAFILE_INITIAL_SIZE",   SYM(DATAFILE_INITIAL_SIZE_SYM)},
   { "DATE",		SYM(DATE_SYM)},
   { "DATETIME",		SYM(DATETIME)},
   { "DAY",		SYM(DAY_SYM)},

MariaDB 10.x 将包含多主复制功能

P.Linux — Wed, 17 Oct 2012 09:08:09 +0000

本文内容遵从CC版权协议, 可以随意转载, 但必须以超链接形式标明文章原始出处和作者信息及版权声明
网址: http://www.penglixun.com/database/multi_source_replication_for_mariadb.html

国庆期间与Monty合作，将我开发的多主复制功能合并到了MariaDB主干，将在10.x版本中出现。

Monty专门写了一片博客来介绍多主复制补丁：http://monty-says.blogspot.com/2012/10/multi-source-replication-for-mariadb-is.html

虽然MariaDB 10.x还没正式发布，但是已经可以下载最新的源码树来编译使用：https://code.launchpad.net/~maria-captains/maria/10.0-base

目前已知的问题就是采用多主复制以后，半同步（Semi-sync）会无法使用，这个要fix估计还需要一点时间，如果你不使用半同步，并且急切的需要使用多主复制，那么可以直接采用源码树上的代码，不再需要把我的补丁打到MySQL中再编译了。而且一般来说用多主复制都是为了聚合数据进行分析，而MariaDB的优化器不用多言，在MySQL的分支中是最强大的，正好可以更好的做OLAP。

具体的使用文档看这里：https://kb.askmonty.org/en/multi-source-replication/

值得一提的是，这次合并以后增加了SHOW ALL SLAVES STATUS功能，可以显示所有的通道复制情况。START/STOP ALL SLAVES 也可以一次性启停所有通道。另外一直影响大家使用的无法跳过指定通道错误的问题，也顺便修复了，增加了一个变量，set @@default_master_connection=’connection_name’，这样可以指定一个通道，然后用单通道的Sql_slave_skip_counter就可以了。

当然也要感谢Monty为我review patch，发现那么多隐含问题，并且给我commit权限，希望能给开源做更多的事情，对MySQL做更多的改进。

SVN：合并一个分支到主干

P.Linux — Fri, 21 Sep 2012 06:18:18 +0000

本文内容遵从CC版权协议, 可以随意转载, 但必须以超链接形式标明文章原始出处和作者信息及版权声明
网址: http://www.penglixun.com/tech/program/svn_merge_branch_trunk.html

原文在此，我只是翻译：http://www.sepcot.com/blog/2007/04/SVN-Merge-Branch-Trunk

这篇文章只是写给我自己备用的，但是写出来可能更多的人会觉得这很有用。

最近在工作中，我被分配了更多的职责。包括部分网站的分支控制工作。我花了一段时间才理清楚如何处理所有的事情，并且大部分在网络上找到的资料对我都没有太大的帮助，所以我会在这里发这篇文章来阐述。

我们采用SVN做代码版本控制，并且代码存在一台可以用SSH访问的服务器上。

合并一个分支到主干？

获取一份主干的副本：

svn co svn+ssh://server/path/to/trunk
获取你需要合并的分支的副本：

svn co svn+ssh://server/path/to/branch/myBranch
把你当前工作目录换到 “myBranch”，找到“myBranch”的起始版本：

svn log –stop-on-copy

这会显示你的分支从主干分离出来的点。记住这个数字（就是 rXXXX，XXXX 就是版本号）。
把你的当前工作目录换到主干，执行一个SVN更新：

svn up

这会更新你的主干副本到最新版本，并且告诉你最新版本号是多少。也把这个数字记好（应该是这样的提示“At revision YYYY”，YYYY就是你需要记住的第二个数字）。
现在我们可以执行SVN合并：

svn merge -rXXXX:YYYY svn+ssh://server/path/to/branch/myBranch

这会把你的分支中所有的更新放到主干。
解决所有合并中出现的冲突。
检查结果：

svn ci -m “MERGE myProject myBranch [XXXX]:[YYYY] into trunk”

就是这些。现在你把“myBranch”合并到了主干。
That is it. You have now merged “myBranch” with trunk.

更新

第 2～4 步可以用下面的命令替换：

svn log –stop-on-copy svn+ssh://server/path/to/branch

额外的东西

分离一个分支比合并一个分支简单的多。这里告诉你怎么做。
执行一个SVN拷贝：

svn copy svn+ssh://server/path/to/trunk svn+ssh://server/path/to/branch/newBranch -m “Cut branch: newBranch”

这是所有的内容，希望有所帮助。

InnoDB一定会在索引中加上主键吗

P.Linux — Thu, 20 Sep 2012 05:59:33 +0000

本文内容遵从CC版权协议, 可以随意转载, 但必须以超链接形式标明文章原始出处和作者信息及版权声明
网址: http://www.penglixun.com/database/will_innodb_store_pk_in_index.html

DBA群里在讨论一个问题，到底InnoDB会不会在索引末尾加上主键，什么时候会加？

我之前看代码记得是如果索引末尾就是主键，那么InnoDB就不再添加主键了，如果索引末尾不是主键，那么会添加主键，但是这跟测试结果不符：

CREATE TABLE t (
  a char(32) not null primary key,
  b char(32) not null,
  KEY idx1 (a,b),
  KEY idx2 (b,a)
) Engine=InnoDB;

插入部分数据后可以看到idx1和idx2两个索引的大小相同。这说明idx1和idx2的内部结构是一样的，因此 不可能 是idx1在内部存为(a,b,a)。

在登博的指导下看了 dict0dict.cc:dict_index_build_internal_non_clust() 这个函数，就是构造索引的数据字典的过程，理解了这个过程就明白了，我们接下来解读下这个函数（基于5.6最近trunk）：

2727 /*******************************************************************//**
2728 Builds the internal dictionary cache representation for a non-clustered
2729 index, containing also system fields not defined by the user.
2730 @return own: the internal representation of the non-clustered index */
2731 static
2732 dict_index_t*
2733 dict_index_build_internal_non_clust(
2734 /*================================*/
2735   const dict_table_t* table,  /*!< in: table */
2736   dict_index_t*   index)  /*!< in: user representation of
2737           a non-clustered index */
2738 {
2739   dict_field_t* field;
2740   dict_index_t* new_index;
2741   dict_index_t* clust_index;
2742   ulint   i;
2743   ibool*    indexed;
2744 
2745   ut_ad(table && index);
2746   ut_ad(!dict_index_is_clust(index));
2747   ut_ad(mutex_own(&(dict_sys->mutex)));
2748   ut_ad(table->magic_n == DICT_TABLE_MAGIC_N);
2749 
2750   /* The clustered index should be the first in the list of indexes */
2751   clust_index = UT_LIST_GET_FIRST(table->indexes);
2752 
2753   ut_ad(clust_index);
2754   ut_ad(dict_index_is_clust(clust_index));
2755   ut_ad(!dict_index_is_univ(clust_index));
2756 
2757   /* Create a new index */
2758   new_index = dict_mem_index_create(
2759     table->name, index->name, index->space, index->type,
2760     index->n_fields + 1 + clust_index->n_uniq);
2761 
2762   /* Copy other relevant data from the old index
2763   struct to the new struct: it inherits the values */
2764 
2765   new_index->n_user_defined_cols = index->n_fields;
2766 
2767   new_index->id = index->id;
2768 
2769   /* Copy fields from index to new_index */
2770   dict_index_copy(new_index, index, table, 0, index->n_fields);
2771 
2772   /* Remember the table columns already contained in new_index */
2773   indexed = static_cast(
2774     mem_zalloc(table->n_cols * sizeof *indexed));
2775 
2776   /* Mark the table columns already contained in new_index */
2777   for (i = 0; i < new_index->n_def; i++) {
2778 
2779     field = dict_index_get_nth_field(new_index, i);
2780 
2781     /* If there is only a prefix of the column in the index
2782     field, do not mark the column as contained in the index */
2783 
2784     if (field->prefix_len == 0) {
2785 
2786       indexed[field->col->ind] = TRUE;
2787     }
2788   }
2789 
2790   /* Add to new_index the columns necessary to determine the clustered
2791   index entry uniquely */
2792 
2793   for (i = 0; i < clust_index->n_uniq; i++) {
2794 
2795     field = dict_index_get_nth_field(clust_index, i);
2796 
2797     if (!indexed[field->col->ind]) {
2798       dict_index_add_col(new_index, table, field->col,
2799              field->prefix_len);
2800     }
2801   }
2802 
2803   mem_free(indexed);
2804 
2805   if (dict_index_is_unique(index)) {
2806     new_index->n_uniq = index->n_fields;
2807   } else {
2808     new_index->n_uniq = new_index->n_def;
2809   }
2810 
2811   /* Set the n_fields value in new_index to the actual defined
2812   number of fields */
2813 
2814   new_index->n_fields = new_index->n_def;
2815 
2816   new_index->cached = TRUE;
2817 
2818   return(new_index);
2819 }

这是整个函数，读者最好可以先自己读读这个函数理解一下，然后再看分析。

好了，下面我们开始分析了，首先把 dict_table_t 这个结构体的相关成员解释一下：

 474   unsigned  n_user_defined_cols:10;
 475         /*!< number of columns the user defined to
 476         be in the index: in the internal
 477         representation we add more columns */
 478   unsigned  n_uniq:10;/*!< number of fields from the beginning                                                                                                                                                                         
 479         which are enough to determine an index
 480         entry uniquely */
 481   unsigned  n_def:10;/*!< number of fields defined so far */
 482   unsigned  n_fields:10;/*!< number of fields in the index */

注释很好理解，主要是 n_uniq 表示索引中需要多少个字段来唯一标识一行数据，只对唯一索引有效；n_def 是有多少个字段用了扩展存储空间，就是索引中只存前缀； n_fields 是索引最终一共有多少字段，包括系统加的；n_user_defined_cols 是用户定义的字段数，不包括系统自动加的。

然后我们来看两段最主要的代码：

2772   /* Remember the table columns already contained in new_index */
2773   indexed = static_cast(
2774     mem_zalloc(table->n_cols * sizeof *indexed));
2775 
2776   /* Mark the table columns already contained in new_index */
2777   for (i = 0; i < new_index->n_def; i++) {
2778 
2779     field = dict_index_get_nth_field(new_index, i);
2780 
2781     /* If there is only a prefix of the column in the index
2782     field, do not mark the column as contained in the index */
2783 
2784     if (field->prefix_len == 0) {
2785 
2786       indexed[field->col->ind] = TRUE;
2787     }
2788   }

InnoDB首先创建了一个布尔型数组，然后依次循环索引上的每一个字段，如果这个字段不是只有前缀，那么就在数组中记下它的索引号，标记这个字段在索引中出现了。因此indexed数组就存下了索引中用户定义的所有字段序号。

2790   /* Add to new_index the columns necessary to determine the clustered
2791   index entry uniquely */
2792 
2793   for (i = 0; i < clust_index->n_uniq; i++) {
2794 
2795     field = dict_index_get_nth_field(clust_index, i);
2796 
2797     if (!indexed[field->col->ind]) {
2798       dict_index_add_col(new_index, table, field->col,
2799              field->prefix_len);
2800     }
2801   }

这一段就开始循环聚集索引（主键）的每个字段，盘下indexed数组中这个字段是不是有了，如果没有，那么再调用 dict_index_add_col 把字段加到索引中。

因此只要用户定义的索引字段中包含了主键中的字段，那么这个字段就不会再被InnoDB自动加到索引中了，如果用户的索引字段中没有完全包含主键字段，InnoDB就会把剩下的主键字段加到索引末尾。

因此我们最初的例子中， idx1 和 idx2 两个索引内部大小完全一样，没有区别。

最后再补充下组合主键的例子：

CREATE TABLE t (
  a char(32) not null,
  b char(32) not null,
  c char(32) not null,
  d char(32) not null,
  PRIMARY KEY (a,b)
  KEY idx1 (c,a),
  KEY idx2 (d,b)
) Engine=InnoDB;

这个表InnoDB会自动补全主键字典，idx1 实际上内部存储为 (c,a,b)，idx2 实际上内部存储为 (d,b,a)。
但是这个自动添加的字段，Server层是不知道的，所以MySQL优化器并不知道这个字段的存在，所以如果你有一个查询：

SELECT * FROM t WHERE d=x1 AND b=x2 ORDER BY a;

其实内部存储的idx2(d,b,a)可以让这个查询完全走索引，但是由于Server层不知道，所以最终MySQL优化器可能选择 idx2(d,b) 做过滤然后排序 a 字段，或者直接用PK扫描避免排序。

而如果我们定义表结构的时候就定义为 KEY idx2(d,b,a) ，那么MySQL就知道(d,b,a)三个字段索引中都有，并且InnoDB发现用户定义的索引中包含了所有的主键字段，也不会再添加了，并没有增加存储空间。

因此，由衷的建议，所有的DBA建索引的时候，都在业务要求的索引字段后面补上主键字段，这没有任何损失，但是可能给你带来意外的惊喜。

希望大家能理解。这篇木有国际友人需要看，就木有英文版了～

InnoDB实现独立表空间多数据文件 (InnoDB multiple datafiles per single-tablespace)

P.Linux — Wed, 12 Sep 2012 08:47:26 +0000

本文内容遵从CC版权协议, 可以随意转载, 但必须以超链接形式标明文章原始出处和作者信息及版权声明
网址: http://www.penglixun.com/database/innodb_multiple_datafiles_per_single_tablespace.html

我们知道，在Oracle中，每个表空间都可以由很多文件组成，这样文件的IO就可以分散在很多存储路径上。虽然MySQL的服务器一般来说不会配置多路径存储，但是，很多老式文件系统（例如EXT3）对大文件的IO操作支持不好，性能很差，所以对MySQL/InnoDB来说，把数据文件大小控制在比较小的范围，也是有好处的。

As we know that Oracle can let one tablespace contains many datafiles, so file IO can distribute to multiple storage paths. Most of MySQL servers will not use multiple storage paths, but many old filesystems can’t support large files well, if datafiles too larger, performance will be lower, such as EXT3. So keeping MySQL/InnoDB datafiles size in a relatively small range is beneficial, too.

InnoDB在共享表空间模式下，是支持多文件的，用innodb_data_file_path选项可以配置：

InnoDB supports multiple datafiles in the shared tablespace mode, and we can configure it with innodb_data_file_path:

innodb_data_file_path = /disk1/ibdata1:2G;/disk2/ibdata2:2G:autoextend

这样配置就把数据文件分散在了disk1和disk2两个路径下，第一个文件固定2G大小，第二个文件初始化2G，可以自增长。

Configuring innodb_data_file_path like this, InnoDB can distribute the two datafiles to disk1 & disk2. The first file is fiexed size, 2GB. And the second file is auto extened, initial size id 2GB.

但是如果设置了 innodb_file_per_table 选项，每张表都会有一个独立的表空间文件，就不能再对每个独立表空间使用多数据文件了。但是，即使每张表分配一个独立的文件，还是可能有某些表变得非常大，例如我们就有几百GB的表，在XFS文件系统上这还没什么问题，有些系统为了“安全”依然使用EXT3，大文件的操作性能就堪忧了。

But if innodb_file_per_table = TRUE, each tables will have their single tablespace datafile, and innodb_data_file_path only used for system-tablespace, single-tablespace can’t use mutiple datafiles. Even thought each tables will have one single datafile, file maybe become very large, too.

当然，可以通过分库分表分区来让数据文件变小，对于大部分小公司都没有中间件来完成分库分表的工作，而大表随处可见，业务变化快，用分区也不合理，因此，通过为独立表空间增加多数据文件的功能，是很好的选择。

Of course, we can split databases / tables, or use partition, it can let datafiles become small. But most of small companies haven’t middleware to split these, and they also have many big tables. So it’s best for them to use mutile datafiles per single-tablespace feature.

如何通过尽可能少的改动，来为InnoDB独立表空间也增加多数据文件的功能呢，经过一段时间调研，可以发现，大多数地方，InnoDB并没有用特别的方式来判断是独立表空间还是共享表空间，并且表空间描述符并没有因共享表空间还是独立表空间而有差异，都是使用 fil_space_t，并且其中 fil_space_t->chain 就是记录从属于表空间的所有文件，用 fil_node_t 描述。

How to implement multiple datafiles per single-tablespace feature with modifying source code as little as possible? I found something useful through research, InnoDB haven’t do many special judge for shared/single-tablesapce, and tablespace descriptor is the same for them (fil_space_t). And fil_space_t->chain (fil_node_t) is the list of the files belong to this tablespace.

尤其当我看到这个注释时：

Especially when I saw this comment:

  /* TODO: The following code must change when InnoDB supports
  multiple datafiles per tablespace. */

我觉得InnoDB团队在开发时，也已经考虑到了未来需要增加表空间多文件的支持，更让我确信这是可以实现的。

I think InnoDB team want to do it, too. And they are already do enough preparation when they code. So I’m sure I can implement this feature.

因此基于5.6的源码树修改代码测试，我觉得如下思路是靠谱的，正按着这个方案重新整理代码：

And then I try to modify code on MySQL 5.6 source code, I found a practical way, I’m coding with this design:

用户接口 (User Interface)：

CREATE TABLE语法新增两个参数：DATAFILE_INITIAL_SIZE, DATAFILE_NUM，分别表示数据文件初始大小和数据文件数量。

I added two options in CREATE TABLE syntax: DATAFILE_INITIAL_SIZE & DATAFILE_NUM. They represent the initial size of the data files and the number of data files.

CREATE TABLE table_name (...) ENGINE=InnoDB 
  DATAFILE_INITIAL_SIZE=1000000, DATAFILE_NUM=100;

这样就会建100个包含1000000个页面的文件，命名方式采用 “table_name#num.ibd”，都建在默认数据目录下，最多允许初始化255个文件，每个文件都是固定大小，如果还需要增加文件，需要使用ALTER TABLESPACE命令。

This SQL will let MySQL create a table with 100 datafiles, and each datafiles have 1000000 pages. The auto created datafiles named “table_name#num.ibd” in the default datadir. Allowed to contain up to 255 data files, each datafiles are fixed size. If you want to add datafile after created table, you need to use ALTER TABLESPACE command.

ALTER TABLESPACE `db_name/table_name` 
  ADD DATAFILE '/diskN/table_name#256' 
  INITIAL_SIZE = 5000 AUTOEXTEND_SIZE=1000 ENGINE=InnoDB;

这个命令会为db_name下的table_name表增加一个数据文件，位置在”/diskN/table_name#256.ibd”（后缀自动加），初始大小为5000个页面，每次自动扩展1000个页面。

This SQL will add a datafile for db_name.tablename, datafile path is “/diskN/table_name#256.ibd” (suffix .ibd is added automatically), initial size is 5000 pages, each autoextend operation will extend 1000 pages.

设计细节 (Details)：

1. 在I_S表中的tables表增加data_file_path字段，用于展示表的数据文件位置和大小，类似innodb_data_file_path中共享表空间的记录方式。

1. Adding a column named “data_file_path” on “I_S.tables” table to record the datafiles path and size, like “innodb_data_file_path” option.

2. 在数据目录下，增加table_name.dbf文件，为每张表持久化类似innodb_data_file_path字段的数据文件路径信息。

2. Adding a “table_name.dbf” file for recording datafiles information in datadir. The format like “innodb_data_file_path” option.

3. 在 fil_space_t 结构体中增加三个字段，跟InnoDB全局变量中定义的含义一样，分别用于记录属于表空间的数据文件数量，数据文件名，数据文件大小。

3. Adding 3 variables in fil_space_t, the meaning like InnoDB global variables “srv_n_data_files, srv_data_file_names, srv_data_file_sizes”, but they for each tablespace here.

  ulint   n_data_files;    /* The number of datafiles */
  char**  data_file_names; /* Every datafiles' name */
  ulint*  data_file_sizes; /* Every datafiles' size */

4. 增加 srv_ibd_file_initial_size 全局变量，默认等于 FIL_IBD_FILE_INITIAL_SIZE，建表时如果设置了DATAFILE_INITIAL_SIZE选项，并且这个选项 > FIL_IBD_FILE_INITIAL_SIZE，则创建表时用 srv_ibd_file_initial_size 作为初始化大小，这样某些已知会很大的表可以预先扩展，避免未来高速写入时出现扩展问题。

Adding “srv_ibd_file_initial_size” global variable. Its default value is FIL_IBD_FILE_INITIAL_SIZE. If you set DATAFILE_INITIAL_SIZE on “CREATE TABLE”, and the value > FIL_IBD_FILE_INITIAL_SIZE, then table datafile initial size will set to srv_ibd_file_initial_size. So if you know a table will be very large, you can set this option to pre-extend datafile size, it can avoid extend datafile operation when insert heavy workload.

5. 增加 fil_create_new_datafile_for_single_table_tablesapce() 函数，增加新的数据文件时调用这个函数，会用 os_file_create() 来创建新文件，并用 os_file_set_size() 设置大小，然后用 fil_node_create() 创建node加入 fil_space_t->chain，并更新 fil_space_t->n_data_files/data_file_names/data_file_sizes 三个变量。

Adding fil_create_new_datafile_for_single_table_tablesapce() function, it can add a new datafile for single-tablespace. It will call os_file_create() to create new file, and call os_file_set_size() to set size, and then call fil_node_create() to create a “node”, this “node” will add to fil_space_t->chain. fil_space_t->n_data_files/data_file_names/data_file_sizes will be updated in the same time.

6. InnoDB启动时在 open_or_create_data_files() 函数中增加检查步骤，查看是否有 table_name.dbf 文件，如果有则读取其中字串，复用共享表空间的处理代码，将解析结果存入表空间结构体 fil_space_t->n_data_files/data_file_names/data_file_sizes。

Adding some process in InnoDB startup function, open_or_create_data_files(). I will check if “table_name.dbf” file is existed, if it’s existed, I will read the string from it. I will use the code that parse “innodb_data_file_path” string, and storing the result to fil_space_t->n_data_files/data_file_names/data_file_sizes.

最终代码很快就会发布。
Code will be released & published in recently.

MySQL下实现闪回的设计思路 (MySQL Flashback Feature)

P.Linux — Sun, 09 Sep 2012 05:43:21 +0000

本文内容遵从CC版权协议, 可以随意转载, 但必须以超链接形式标明文章原始出处和作者信息及版权声明
网址: http://www.penglixun.com/database/mysql_flashback_feature.html

用过Oracle数据库的同学都知道，Oracle有一个Flash Recovery Area，可以把变更的块写入这块区域，当数据操作错误，需要恢复的时候，可以利用闪回空间中存储的数据块覆盖回去，也可以重构回滚段，恢复到需要的一致点。
As we know, There has a Flash Recovery Area in Oracle DB, Which allows the modified blocks been written into. So that, if there’s any incorrect deletion of data, and need to recover, DBA can use the data blocks which were stored in the Flash Recovery Area ,or reconstructed rollback segments, to restore the data to the consistent point.

而MySQL/InnoDB暂时没有提供这些功能，但是InnoDB很多设计都参考了Oracle，因此我觉得InnoDB也可以实现Flashback功能。
MySQL / InnoDB haven’t performed this great and useful function before I worked on it , though many designs of InnoDB are referred to Oracle. In this case, I think InnoDB should implement Flashback as well.

最开始我是想仿照Oracle，利用undo log来闪回，通过把COMMITTED的TRX标记为UNCOMMITTED，让InnoDB认为已经提交的事务没有提交，从而进行回滚。
具体方案是这样：
At first, I want to implement this feature, Oracle of reference. I can set COMMITTED transactions to UNCOMMITTED status during InnoDB starting with processing undo log. Then InnoDB will regard these committed transactions as uncommitted one, and rollback it.
Here are the details:

1. 在my.cnf中配置一个InnoDB_Flashback_Trx_ID的参数，标识回滚到这个trx_id的一致状态。
1. Add an option on my.cnf named InnoDB_Flashback_Trx_ID. It mean InnoDB need rollback to this trx snapshot.

2. 在InnoDB启动读取回滚段构造回滚事务时，凡是比InnoDB_Flashback_Trx_ID大的事务，都标记为UNCOMMITTED。
2. When InnoDB starting, and reading undo segments, I will set all transactions that trx_id > InnoDB_Flashback_Trx_ID to UNCOMMITTED.

3. InnoDB会把这些提交的事务认为没有提交，进而构造未提交事务，利用InnoDB自己的机制，将会在打开数据库前回滚这些事务。
3. InnoDB will consider these committed transactions are uncommitted, so construction the trx, and after construction all uncommitted transactions, InnoDB will rollback these transactions.

但这个方案有明显的弊端，首先只能适用于InnoDB，然后闪回操作需要重启，并且在实际编码实现这个方案的测试中发现，如果发生了DDL，再做一次闪回到DDL之前的TRX_ID，那么InnoDB会崩溃，并且无法再启动，应该是数据文件已经损坏，因为InnoDB的undo是逻辑记录，而非物理记录。
But this way have an Obvious disadvantages, it can only used by InnoDB. And flashback need restart MySQL. In the actual coding I found that if InnoDB did DDL, and I will rollback to the TRX_ID before DDL, InnoDB will crash, and can’t start again. I think the datafiles is corrupted, because InnoDB undo is logical records, not physical records.

因此想到了第二个方案，就是利用binlog，因为如果是ROW格式的binlog，其中记录了每个ROW的完整信息，INSERT会包含每个字段的值，DELETE也会包含每个字段的值，UPDATE会在SET和WHERE部分包含所有的字段值。因此binlog就是个完整的逻辑redo，把它的操作逆过来，就是需要的“undo”。
具体方案是这样：
So I think another way that use binlog. Because the ROW format binlog will record whole information about modified rows. INSERT/DELETE will contain all columns’ values. UPDATE will contain all columns’ on SET/WHERE part. So binlog like a whole logical redo log, reversed them can get the “undo” I need. Detail:

1. 修改Row_log_event的print的结果，将Event_type逆转：WRITE_ROWS_EVENT转为DELETE_ROWS_EVENT / DELETE_ROWS_EVENT转为WRITE_ROWS_EVENT，这只要改一个标记位即可，就是第4个字节ptr[4]。
1. Modifying the result of Row_log_event::print that reversed Event_type: Modifying WRITE_ROWS_EVENT to DELETE_ROWS_EVENT / DELETE_ROWS_EVENT to WRITE_ROWS_EVENT, this change need only modify a byte, that’s ptr[4].

2. 对于UPDATE_ROWS_EVENT，需要对调SET和WHERE部分，这是唯一相对有点麻烦的地方，我增加了个exchange_update_rows函数来完成。主要是利用print_verbose_one_row函数来解析出SET和WHERE部分的长度，以此来推断SET和WHERE的分割点，然后用memcpy交换。
2. With UPDATE_ROWS_EVENT, it need swap SET/WHERE parts. This is the only place has little trouble, I added an exchange_update_rows() function to do it. It will use print_verbose_one_row() to parse the length of SET/WHERE parts, so I can get the cut-point of SET/WHERE parts, and then swap it with memcpy().

3. 得到了逆转后的Event，就需要逆转输出。因此我在内存中拦截输出，我修改了Write_on_release_cache类，并且在Log_event中增加了一个buff，可以把Event的print结果打印在buff中，因此mysqlbinlog可以得到每个event的输出，并且存在内存中。
3. After get the reversed Event, it need reverse the sequence of Events. So I intercepted event output in memory by modifying Write_on_release_cache class, and I added a buff member on Log_event to save the print output. So mysqlbinlog can get all events’ output, and store in memory.

4. mysqlbinlog中我用动态数组存下所有的event输出，然后就从末尾向前逆向输出所有的事件，这样就可以获得闪回的逆操作文件，把这个文件导入目标库既可以完成闪回。
4. I used DYNAMIC_ARRAY to cache all events’ output in mysqlbinlog. and then I print the events’ output from end to begin, so I get the flashback file. You can import this file to MYSQL, data can flashback.

这个方案的好处很明显，通用于所有的存储引擎，因为binlog是Server层的。另外可以利用mysqlbinlog已有的各种filter来筛选部分日志输出为回滚日志，这样可以灵活选择闪回某一段操作，闪回某一个库的操作，某一个时间段的操作等等。
The advantage of this way is that all store engines can use it, because binlog is the log of Server. And then, mysqlbinlog have many filters, such as start-position/start-datatime and so on.

补丁可以看这里(Patch here)：http://mysql.taobao.org/index.php/Patch_source_code#Add_flashback_feature_for_mysqlbinlog

百度AStar2008的一道题：成语纠错

P.Linux — Sun, 17 Jun 2012 07:40:01 +0000

本文内容遵从CC版权协议, 可以随意转载, 但必须以超链接形式标明文章原始出处和作者信息及版权声明
网址: http://www.penglixun.com/tech/program/%e7%99%be%e5%ba%a6astar2008%e7%9a%84%e4%b8%80%e9%81%93%e9%a2%98%ef%bc%9a%e6%88%90%e8%af%ad%e7%ba%a0%e9%94%99.html

有个小盆友正好问我这个问题，当年这个题在Astar我是满分pass了，贴出来参考下，无技术含量。

问题背景
成语是中华民族的文化瑰宝，作为历史的缩影、智慧的结晶、汉语言的精华，闪烁着睿智的光芒。
你的任务是给一个错误的四字成语进行纠错，找到它的正确写法。具体来说，你只允许修改四个汉字中的其中一个，使得修改后的成语在给定的成语列表中出现。原先的错误成语保证不在成语列表中出现。

有时，这样的“纠错”结果并不惟一。例如“一糯千金”可以改为“一字千金”也可以改成“一诺千金”。但由于“糯”和“诺”是同音字，“一糯千金”实为“一诺千金”的可能性比较大。
因此，我们还将提供一个汉字分类表，要求修改前后的两个字必须属于同一个分类。
在这样的限制下，我们保证成语纠错的结果惟一。
注意
1、汉字均采用GBK编码(参见FAQ)
2、每个汉字分类至少包含两个汉字，同一个汉字可能出现在多个类别中。
3、成语列表中的成语都是真实存在的四字成语。成语列表和待纠错成语中的所有汉字均在汉字分类表中的至少一个分类中出现。
输入格式
输入第一行包含两个整数n, m(1<=n<=200, 1<=m<=20000)。n表示汉字类别的个数，m表示成语的个数。以下n行每行用一个无空白分隔符（空格、TAB）的汉字串表示一个分类中的所有汉字。注意，该汉字串最多可能包含200个汉字。以下m行为成语列表，每行一个成语，恰好四个汉字。最后一行为待纠错的成语，恰好四个汉字，且不在成语列表中出现。输出格式仅一行，为一个四字成语。在“修改必须在同一分类中进行”的限制下，输入数据保证纠错结果惟一。样例输入 7 3 糯诺挪喏懦字自子紫籽前钱千牵浅进近今仅紧金斤尽劲完万水睡税山闪衫善扇杉一诺千金一字千金万水千山一糯千金样例输出一诺千金

#include 
#include 
#include 
using namespace std;

int hashkey(char ch1,char ch2)
{
    return ((unsigned char)ch1-129)*190 + ((unsigned char)ch2-64) – (unsigned char)ch2/128;
}

int checksame(string str1,string str2,string &strn,string &strm)
{
    int count = 0;
    for(int i = 0; i < 4; ++i)
        if((str1[i*2]==str2[i*2])&&(str1[i*2+1]==str2[i*2+1]))
            ++count;
            else{
                strn.resize(2);
                strm.resize(2);
                strn[0] = str1[i*2];
                strn[1] = str1[i*2+1];
                strm[0] = str2[i*2];
                strm[1] = str2[i*2+1];
            }
    //cout << "Check:" << str1 << "," << str2
    //<< ":" << count
    //<< "--" << strn[0] << strn[1]
    //<< "|" << strm[0] << strm[1] < > hashn;
    vector< vector < vector > > hashm;
    string str,tmp1,tmp2;
    string kind[200],word[20000];

    cin >> n >> m;

    hashn.resize(25000);
    for(int i = 0; i < 25000; ++i)
        hashn[i].resize(1,0);
    for(int i = 0; i < n; ++i){
        cin >> kind[i];
        //cout << kind[i].size() << endl;
        for(int j = 0; j < (kind[i].size()/2); ++j){
            GBKindex = hashkey(kind[i][j*2],kind[i][j*2+1]);
            count = ++hashn[GBKindex][0];
            hashn[GBKindex].resize(count+1);
            hashn[GBKindex][count] = i;
            //cout << GBKindex << ":" << hashn[GBKindex][0] << ":"
            //<< i << ":" <> word[i];
        for(int j = 0; j < 4; ++j){
            GBKindex = hashkey(word[i][j*2],word[i][j*2+1]);
            count = ++hashm[j][GBKindex][0];
            hashm[j][GBKindex].resize(count+1);
            hashm[j][GBKindex][count] = i;
            //cout << GBKindex << ":" << hashm[j][GBKindex][0] << ":"
            //<< i << ":" << word[i][j*2] << word[i][j*2+1] <> str;

    for(int i = 0; i < 4; ++i ){
        GBKindex = hashkey(str[i*2],str[i*2+1]);
        //cout << GBKindex <

为MySQL增加线程内存监控 (MySQL Thread Memory Usage Monitor)

P.Linux — Fri, 27 Apr 2012 13:55:13 +0000

本文内容遵从CC版权协议, 可以随意转载, 但必须以超链接形式标明文章原始出处和作者信息及版权声明
网址: http://www.penglixun.com/database/mysql_memory_usage_monitor.html

为了国际友人看得懂，以后我的博客都同时提供中英文版。:)
For foreign friends can understand, all of my blog at the same time in English in the future.

使用MySQL中我经常发现mysqld的内存使用会涨的很快（Buffer Pool是大页分配的），以至于使用SWAP，到底Server层用了多少内存，没有一个监控机制，所以第一步我编写了个patch（基于5.6.6）来监控每个线程用了多少内存，一旦mysqld进程使用太多内存，就去看哪些线程用的多，杀掉这些线程。
I often found mysqld process use memory will grow up very fast(InnoDB Buffer Pool used large page), lead to mysqld use SWAP. How many memory MySQL Server(Threads) used? no monitor now! So I write a patch based on MySQL 5.6.6 first, it can monitor how many memory used each threads. If I found mysqld process used too many memory, I can watch which threads used more memory, and kill them.

打上补丁后的效果像这样：
This is the effect after patched:

代码可以看patch
This is the patch:
Note: There is a file embedded within this post, please visit this post to download the file.

基本方法就是在my_malloc和my_free中增加回调函数（@淘宝丁奇提供的思路，太帅了），获取调用my_malloc和my_free函数的THD描述符，用THD中新加的malloc_size字段去记录申请和释放内存，其实my_realloc也应该去更新malloc_size，暂时还没加进去。
The method is add callback function on my_malloc/my_free(Xiaobin Lin give me this Callback idea) to get the THD which call my_malloc/my_free. And use a variable named “malloc_size” on THD to record how many memory malloc/free. In fact, my_realloc is also need calc malloc_size, but I have not add it on this version.

然后使用malloc_usable_size函数在free时判断指针申请了多少内存，在GCC 4.2以上可以使用malloc_size(pointor)去判断。
And then, I use malloc_usable_size function to get the size of pointor which will be free. After GCC 4.2, we can use malloc_size to get it.

下一步我会分类监控，把每个线程sort_buffer/join_buffer/net_buffer等线程级内存都分类统计出来占用多少，方便更直观的监控。
Next step, I will monitor the size of sort_buffer/join_buffer/net_buffer in each threads, not only total size each threads.

这是新版补丁，计算了my_realloc重分配的内存:
This is the new patch, calc my_realloc size:
Note: There is a file embedded within this post, please visit this post to download the file. (基于mysql-5.6.6)
Note: There is a file embedded within this post, please visit this post to download the file. (基于percona-5.5.22)

跳跃表的实现和测试

P.Linux — Tue, 20 Mar 2012 09:35:26 +0000

本文内容遵从CC版权协议, 可以随意转载, 但必须以超链接形式标明文章原始出处和作者信息及版权声明
网址: http://www.penglixun.com/tech/program/skip_list_leveldb.html

LevelDB中一个核心的数据结构就是跳跃表，它是一个类似单向链表的结构但增加了多层指针进行跳跃，可以获得近似平衡树的效率，但是代码远远没有AVL等平衡二叉树实现复杂，所以尽管理论上跳跃表不是一个好算法，但是实现简单令他很多地方都很实用。
这面是一个跳跃表的结构。

这是实现代码和测试代码，非常简单，相比平衡树那是简单了多了去了。
发现一些内存泄露和内存越界，补上fix。

#include 
#include 
#include 
#include 
#include 
#include 
#include 

#define TIME(A,B) (double)(B-A)/CLOCKS_PER_SEC*1000

/* Skip List Level Limit, Level begin wth 0 */
#define MAX_SKIP_LEVEL 6
/* Level Weight for Calc node level */
#define LEVEL_W 6

/* Skip List Struct */
typedef struct skip_list_struct skip_list_t;
struct skip_list_struct {
  int key;
  int value;
  int level;
  /* Level[i] have (i+1) pointer to */
  skip_list_t* forward[MAX_SKIP_LEVEL];
};

/* Rand Node Level  */
int rand_level() 
{
  int level= 0;
  int i= 0;
  /* Node which level more bigger, more less */
  for(; i< MAX_SKIP_LEVEL-1; ++i)
//for(; i< MAX_SKIP_LEVEL; ++i) 这里不对，可能生成超过MAX_SKIP_LEVEL的数
  {
    level+= rand()%10> LEVEL_W? 1: 0;
  }

  return level;
}

/* Make a new node and init it */
skip_list_t* init_skip_list_node (int level, int key, int value) 
{
  skip_list_t* node= (skip_list_t *)malloc(sizeof(skip_list_t));
  node->level= level;
  node->key= key;
  node->value= value;
  int i= 0;
  for (; i< MAX_SKIP_LEVEL; ++i)
  {
    node->forward[i]= NULL;
  }

  return node;
}

/* Insert or Update a value on Skip List */
int skip_list_write (skip_list_t* skip_list, int key, int value)
{
  skip_list_t* update_node[MAX_SKIP_LEVEL];
  skip_list_t* node= skip_list;
  int i= skip_list->level;
  for(; i>=0; --i) 
  {
    while (node->forward[i]!= NULL &&
           key> node->forward[i]->key)
    { 
      node= node->forward[i];
    }
    update_node[i]= node;
  }
  node= node->forward[0]== NULL? node: node->forward[0];
  if (key== node->key) 
  {
    node->value= value;
  } 
  else 
  {
    int level= rand_level();
    node= init_skip_list_node(level, key, value);
    for (i= 0; i<= level; ++i)
    {
      node->forward[i]= update_node[i]->forward[i];
      update_node[i]->forward[i]= node;
    }
  }
}

/* Delete a node on Skip List */
int skip_list_delete (skip_list_t* skip_list, int key)
{
  skip_list_t* update_node[MAX_SKIP_LEVEL];
  skip_list_t* node= skip_list;
  int i= skip_list->level;
  for(; i>=0; --i) 
  {
    while (node->forward[i]!= NULL &&
           key> node->forward[i]->key) 
    {
      node= node->forward[i];
    }
    update_node[i]= node;
  }
  node= node->forward[0]== NULL? node: node->forward[0];
  if (key== node->key) 
  {
    for (i= 0; i<= skip_list->level; ++i) 
    {
      if (update_node[i]->forward[i] != node)
      {
        break;
      }
      update_node[i]->forward[i]= node->forward[i];
    }
    free(node);
    return 0; // SUCCESS
  } 
  else 
  {
    return 1; // NO FOUND
  }
}

/* Search a key from Skip List */
int skip_list_search (skip_list_t* skip_list, int key)
{
  skip_list_t* node= skip_list;
  int level= node->level;
  int i= level;
  for (; i>= 0; --i) 
  {
    while (node->forward[i]!= NULL &&
           key> node->forward[i]->key) 
    {
      node= node->forward[i];
    }
  }
  node= node->forward[0]== NULL? node: node->forward[0];
  if (key== node->key)
  {
    return node->value;
  }
  else
  { 
    return INT_MIN;
  }
}

int print_skip_list (skip_list_t* skip_list)
{
  skip_list_t* node;
  int i= 0;
  for(; i< MAX_SKIP_LEVEL; ++i) 
  {
    node= skip_list->forward[i];
    printf("Level[%d]: ", i);
    while(node!= NULL) 
    {
      printf("%d -> ", node->key);
      node= node->forward[i];
    }
    printf("NULL\n");
  }
}

// 增加释放内存的操作，避免内存泄露
/* Free All Nodes */
int free_skip_list (skip_list_t* skip_list) {
  skip_list_t* node= skip_list->forward[0];
  skip_list_t* next_node;
  while (node!= NULL)
  {
    next_node= node->forward[0];
    free(node);
    node= next_node;
  }
  free(skip_list);
}

int main(int argc, char *argv[])
{
  srand((unsigned)time(0));
  int count= 0;
  int i= 0;

  /* Function Test */
  printf("#### Function Test ####\n");

  count= 20;
  printf("== Init Skip List ==\n");

  skip_list_t* skip_list= init_skip_list_node(MAX_SKIP_LEVEL-1, INT_MIN, INT_MIN);
//  skip_list_t* skip_list= init_skip_list_node(MAX_SKIP_LEVEL, INT_MIN, INT_MIN); 多了一层所以会越界
  for (i= 0; i
这是测试结果：
### Function Test ####

== Init Skip List ==

== Print Skip List ==

Level[0]: 0 -> 1 -> 2 -> 3 -> 4 -> 5 -> 6 -> 7 -> 8 -> 9 -> 10 -> 11 -> 12 -> 13 -> 14 -> 15 -> 16 -> 17 -> 18 -> 19 -> NULL

Level[1]: 1 -> 2 -> 3 -> 4 -> 5 -> 6 -> 7 -> 8 -> 9 -> 10 -> 13 -> 15 -> 16 -> 17 -> 18 -> 19 -> NULL

Level[2]: 1 -> 2 -> 3 -> 5 -> 6 -> 7 -> 8 -> 9 -> 10 -> 13 -> 15 -> 16 -> 17 -> 18 -> NULL

Level[3]: 1 -> 2 -> 6 -> 7 -> 8 -> 10 -> 13 -> 15 -> NULL

Level[4]: 1 -> 2 -> 7 -> 8 -> 10 -> NULL

Level[5]: NULL

== Search Key ==

Search [4]: 4

Search [13]: 13

Search [9]: 9

Search [23]: -2147483648

Search [8]: 8

Search [7]: 7

Search [22]: -2147483648

Search [5]: 5

Search [3]: 3

Search [15]: 15

Search [13]: 13

Search [15]: 15

Search [6]: 6

Search [22]: -2147483648

Search [7]: 7

Search [13]: 13

Search [14]: 14

Search [2]: 2

Search [6]: 6

Search [17]: 17

== Delete Key ==

Delete [15]: SUCCESS

Delete [5]: SUCCESS

Delete [13]: SUCCESS

Delete [10]: SUCCESS

Delete [20]: NO FOUND

Delete [11]: SUCCESS

Delete [14]: SUCCESS

Delete [16]: SUCCESS

Delete [10]: NO FOUND

Delete [16]: NO FOUND

Delete [9]: SUCCESS

Delete [23]: NO FOUND

Delete [6]: SUCCESS

Delete [8]: SUCCESS

Delete [21]: NO FOUND

Delete [4]: SUCCESS

Delete [9]: NO FOUND

Delete [22]: NO FOUND

Delete [20]: NO FOUND

Delete [1]: SUCCESS

== Print Skip List ==

Level[0]: 0 -> 2 -> 3 -> 7 -> 12 -> 17 -> 18 -> 19 -> NULL

Level[1]: 2 -> 3 -> 7 -> 17 -> 18 -> 19 -> NULL

Level[2]: 2 -> 3 -> 7 -> 17 -> 18 -> NULL

Level[3]: 2 -> 7 -> NULL

Level[4]: 2 -> 7 -> NULL

Level[5]: NULL

#### Performance Test ####

== Insert 10^5 Items (6 Level) ==

Time: 1196.923950 ms, Speed: 83547.500000 Node/s

== Search 10^5 Items (6 Level) ==

Time: 1196.629028 ms, Speed: 83568.085938 Node/s

调整MAX_SKIP_LEVEL和LEVEL_W两个常量，可以明显的观察到Speed的变化，自己实现一遍，比看很多遍代码理解深刻多了。
修改后的代码无泄露无越界了：
==1631==

==1631== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 5 from 1)

==1631== malloc/free: in use at exit: 0 bytes in 0 blocks.

==1631== malloc/free: 100,010 allocs, 100,010 frees, 6,400,640 bytes allocated.

==1631== For counts of detected errors, rerun with: -v

==1631== All heap blocks were freed -- no leaks are possible.

广度搜索的各种写法

P.Linux — Mon, 19 Mar 2012 14:26:11 +0000

本文内容遵从CC版权协议, 可以随意转载, 但必须以超链接形式标明文章原始出处和作者信息及版权声明
网址: http://www.penglixun.com/tech/program/bfs_programing.html

前些天被人问到BFS的递归怎么写，思维定式给想成DFS了，今天晚上正好有空练练手，多写代码，防止老年痴呆。
用递归，QUEUE，多线程都写了一遍。

#include 
#include 
#include 
#include 
#include 

/* For @RETURN */
#define TREE_FULL 2
#define TREE_ERROR 1
#define TREE_SUCCESS 0

/* For Tree */
#define TREE_LEVEL 4
#define IS_LEAF_NODE(A) (A->left_node== NULL ||  A->right_node== NULL)

/* For QUEUE */
#define QUEUE_LEN     64
#define QUEUE_INIT    (queue_front=0, queue_rear=0, memset(queue,0,QUEUE_LEN*sizeof(tree_t*)))
#define QUEUE_PUT(A)  queue[queue_front++ % QUEUE_LEN]= A
#define QUEUE_POP     queue[queue_rear++ % QUEUE_LEN]
#define QUEUE_FULL    ((queue_front-queue_rear)>= QUEUE_LEN-1)
#define QUEUE_EMPTY   (queue_front== queue_rear)

#define QUEUE_PUT_PTHREAD(A) pthread_mutex_lock(&queue_mtx);QUEUE_PUT(A);pthread_mutex_unlock(&queue_mtx);
#define QUEUE_POP_PTHREAD(A) pthread_mutex_lock(&queue_mtx);A= QUEUE_POP;pthread_mutex_unlock(&queue_mtx);

/* Binary Tree Struct */
typedef struct tree_struct tree_t;
struct tree_struct {
  int value;
  int level;
  tree_t* left_node;
  tree_t* right_node;
};

/* FIFO Queue */
tree_t* queue[QUEUE_LEN];
int queue_front= 0;
int queue_rear= 0;

/* Multi-Thread */
pthread_mutex_t queue_mtx= PTHREAD_MUTEX_INITIALIZER;  

/* Initialize Binary Tree using Recursion */
int recursion_init_node (tree_t *node, int level)
{
  if (node== NULL) 
    return TREE_ERROR;

  node->value= rand()%10;
  node->level= level;
  printf("Level: %d, Value: %d\n", level, node->value);
  if (level>= TREE_LEVEL) 
  {
    node->left_node= NULL;
    node->right_node= NULL;
    return TREE_FULL;
  }
  node->left_node= (tree_t *)malloc(sizeof(tree_t));
  node->right_node= (tree_t *)malloc(sizeof(tree_t));

  ++level;
  recursion_init_node(node->left_node, level);
  recursion_init_node(node->right_node, level);

  return TREE_SUCCESS;
}

/* Visit Tree of BFS using Recursion */
int recursion_visit_tree (int level) 
{
  if (level> TREE_LEVEL)
    return TREE_FULL;
  printf("Level%d: ", level);

  int i= 0;
  tree_t* tmp[QUEUE_LEN];
  while (!QUEUE_EMPTY) {
    tree_t* node= QUEUE_POP;
    printf("| Value: %d |", node->value);

    if (IS_LEAF_NODE(node))
      continue;

    tmp[i++]= node->left_node;
    tmp[i++]= node->right_node;
  }
  printf("\n");

  int j;
  for(j =0; jlevel) {
      level= node->level;
      printf("\nLevel%d: ", level);
    }
    printf("| Value: %d |", node->value);

    if (IS_LEAF_NODE(node)) 
      continue;
    QUEUE_PUT(node->left_node);
    QUEUE_PUT(node->right_node);
  }
  printf("\n");
  return TREE_SUCCESS;
}

/* Visit Tree of BFS using Multi-Thread */
void* visit_node_func(void* node)
{
  tree_t* tree_node= (tree_t *)node;
  printf("| Value: %d |", tree_node->value);
  if (!IS_LEAF_NODE(tree_node)) {
    QUEUE_PUT_PTHREAD(tree_node->left_node);
    QUEUE_PUT_PTHREAD(tree_node->right_node);
  }
}

int concurrency_visit_tree(tree_t* root_node) 
{
  pthread_t tid[QUEUE_LEN];
  tree_t* node;
  int level= 0;

  QUEUE_PUT(root_node);
  while (!QUEUE_EMPTY) {
    printf("Level%d: ", level++);

    pthread_mutex_lock(&queue_mtx);
    int count= 0;
    while (!QUEUE_EMPTY) {
      node= QUEUE_POP;
      pthread_create(&tid[count++], NULL, visit_node_func, (void *)node);
    }
    pthread_mutex_unlock(&queue_mtx);

    int i= 0;
    for(i=0; i< count; ++i)
      pthread_join(tid[i], NULL);

    printf("\n");
  }
}

/* Main */
int main(int argc, char *argv[])
{
  srand((unsigned)time(0));

  printf("== Init Tree ==\n");
  tree_t* tree= (tree_t *)malloc(sizeof(tree_t));
  recursion_init_node(tree, 0);

  printf("== Recursion Visit BFS Tree ==\n");
  QUEUE_INIT;
  QUEUE_PUT(tree); 
  recursion_visit_tree(0);

  printf("== Queue Visit BFS Tree ==\n");
  QUEUE_INIT;
  queue_visit_tree(tree);

  printf("== Multi-Thread Visit BFS ==\n");
  QUEUE_INIT;
  concurrency_visit_tree(tree);
}

== Init Tree ==
Level: 0, Value: 0
Level: 1, Value: 9
Level: 2, Value: 1
Level: 3, Value: 7
Level: 4, Value: 7
Level: 4, Value: 4
Level: 3, Value: 0
Level: 4, Value: 9
Level: 4, Value: 5
Level: 2, Value: 1
Level: 3, Value: 6
Level: 4, Value: 9
Level: 4, Value: 5
Level: 3, Value: 3
Level: 4, Value: 9
Level: 4, Value: 3
Level: 1, Value: 9
Level: 2, Value: 9
Level: 3, Value: 1
Level: 4, Value: 2
Level: 4, Value: 1
Level: 3, Value: 6
Level: 4, Value: 2
Level: 4, Value: 4
Level: 2, Value: 0
Level: 3, Value: 2
Level: 4, Value: 5
Level: 4, Value: 5
Level: 3, Value: 3
Level: 4, Value: 2
Level: 4, Value: 2
== Recursion Visit BFS Tree ==
Level0: | Value: 0 |
Level1: | Value: 9 || Value: 9 |
Level2: | Value: 1 || Value: 1 || Value: 9 || Value: 0 |
Level3: | Value: 7 || Value: 0 || Value: 6 || Value: 3 || Value: 1 || Value: 6 || Value: 2 || Value: 3 |
Level4: | Value: 7 || Value: 4 || Value: 9 || Value: 5 || Value: 9 || Value: 5 || Value: 9 || Value: 3 || Value: 2 || Value: 1 || Value: 2 || Value: 4 || Value: 5 || Value: 5 || Value: 2 || Value: 2 |
== Queue Visit BFS Tree ==
Level0: | Value: 0 |
Level1: | Value: 9 || Value: 9 |
Level2: | Value: 1 || Value: 1 || Value: 9 || Value: 0 |
Level3: | Value: 7 || Value: 0 || Value: 6 || Value: 3 || Value: 1 || Value: 6 || Value: 2 || Value: 3 |
Level4: | Value: 7 || Value: 4 || Value: 9 || Value: 5 || Value: 9 || Value: 5 || Value: 9 || Value: 3 || Value: 2 || Value: 1 || Value: 2 || Value: 4 || Value: 5 || Value: 5 || Value: 2 || Value: 2 |
== Multi-Thread Visit BFS ==
Level0: | Value: 0 |
Level1: | Value: 9 || Value: 9 |
Level2: | Value: 1 || Value: 1 || Value: 9 || Value: 0 |
Level3: | Value: 7 || Value: 6 || Value: 1 || Value: 6 || Value: 2 || Value: 3 || Value: 3 || Value: 0 |
Level4: | Value: 7 || Value: 4 || Value: 9 || Value: 5 || Value: 2 || Value: 1 || Value: 2 || Value: 4 || Value: 5 || Value: 5 || Value: 2 || Value: 2 || Value: 9 || Value: 3 || Value: 9 || Value: 5 |

自己动手实现Multi-Master Replication

P.Linux — Tue, 14 Feb 2012 11:57:50 +0000

本文内容遵从CC版权协议, 可以随意转载, 但必须以超链接形式标明文章原始出处和作者信息及版权声明
网址: http://www.penglixun.com/database/diy_multi_master_replication.html

首发：http://www.mysqlops.com/2012/02/14/diy_multi_master_replication.html

直到今天为止，MySQL依然只支持一个Slave从一个Master复制数据，虽然也可以做到一主多备(M->S)，双主复制(M<->M)等架构，但是局限性依然很大。
例如最近我们遇到一个问题，需要为线上的集群搭建在线延时备份，即从线上的双主集群中再延伸出一组Slave，以防重要集群主备都宕机。按照现在MySQL的架构，要搭建这种在线备份，只能启动相同数据的实例来实现，假设线上有128个实例在提供服务，那么我就需要128个实例来做这128个实例的复制，这个管理成本是巨大的。
之前我们也有个方案，利用Perl脚本来做，参见这篇文章：点我阅读。这个方案的最大问题就是管理不方便，没有可以监控的地方，也不能随便停止脚本等等，如果完善这些部分，代码量太大，几乎就实现了一个MySQL Replication，那还不如利用MySQL的管理部分，在MySQL里实现多Master。

通过研究源码，可以发现，MySQL管理每个复制通道，都是通过一个Master_info类（sql/rpl_mi.h中定义），start_slave/change_master/stop_slave/show_slave/end_slave这些函数都需要传入一个Master_info指针，这就给我们改造多Master提供了很大的便利，基本只需要为每个复制通道传入相应的Master_info即可。

除了找到函数入口，还需要让语法支持多主，否则CHANGE MASTER TO语句并不能支持多主。我修改了sql_yacc.yy，支持如下语法：
CHANGE MASTER ‘通道标识’ TO，START SLAVE ‘通道标识’，STOP SLAVE ‘通道标识’，SHOW SLAVE ‘通道标识’ STATUS。
这样就可以支持多Master的语法了。

另一个问题是怎么保存多个通道的信息，默认单通道的情况下，用master.info存Master的信息，用relay-log.info存复制应用的情况。所以存储文件的名称也要修改，我的方式是，master.info和relay-log.info在末尾加上通道标识后缀，例如名为”plx”的通道，会存成master.info.plx和relay-log.info.plx。Relay Log因为有序列，所以增加”-通道标识”在序列前。
还有一个问题就是，操作命令都是用通道标识来确定一个通道，那么肯定需要持久化正在用的通道名称，以及建立通道后可以用通道名获取相应的Master_info。于是我新建了一个MASTER_INFO_INDEX类（在sql/rpl_mi.h），里面包含一个通道标识和Master_info指针的对应HASH表，以及持久化需要的IO_CACHE，通过master.info.index这个文件来存已有的通道标识。
命名实例如下：

-rw-rw—- 1 mysql mysql 10 Feb 13 20:40 master.info.index
-rw-rw—- 1 mysql mysql 76 Feb 14 17:27 master.info.plx1
-rw-rw—- 1 mysql mysql 71 Feb 14 17:27 master.info.plx2
-rw-rw—- 1 mysql mysql 90 Feb 14 17:25 relay-log.info.plx1
-rw-rw—- 1 mysql mysql 90 Feb 14 17:27 relay-log.info.plx2

-rw-rw—- 1 mysql mysql 160 Feb 14 10:16 mysql-relay-bin-plx1.000011
-rw-rw—- 1 mysql mysql 83765425 Feb 14 17:27 mysql-relay-bin-plx1.000012
-rw-rw—- 1 mysql mysql 106 Feb 14 10:16 mysql-relay-bin-plx1.index
-rw-rw—- 1 mysql mysql 160 Feb 14 10:16 mysql-relay-bin-plx2.000014
-rw-rw—- 1 mysql mysql 83455792 Feb 14 17:27 mysql-relay-bin-plx2.000015
-rw-rw—- 1 mysql mysql 106 Feb 14 10:16 mysql-relay-bin-plx2.index

下载Patch在此：http://bugs.mysql.com/file.php?id=18020

有了多Master以后我们可以做什么呢？下面给两个应用场景。
第一个是一备多的备份。因为我们采用的分库策略，使我们一个集群会有很多个实例，每个实例里面有几个Schema，但是肯定不会重复。例如第一个实例是1～3号Schema。第二个实例就是4～6号Schema，所以binlog应用到一起并不会冲突数据。这是我们测试的在线备份方案。

第二个是跨机房的HA。为了容灾或者加速，很多公司都采用在不同机房部署数据库的方式，所以就涉及到数据同步。为了保证每个机房产生的数据不冲突，一般来说我们采用的是auto_increment_increment，auto_increment_offset这两个参数，可以控制步进。例如双MAster，我们会配置主库是奇数序列的ID，备库是偶数序列的ID，这样切换时就算有少量binlog还未应用，也不会导致数据冲突。跨机房以后，例如两个机房都有双Master，两个机房之间数据又需要同步，以前需要借助第三方脚本或者程序，有了多Master，按如下方式搭建，设置步进为4，就可以保证每个机房有双MAster HA，机房之间数据又可以同步。

已知缺陷：
1. 我还没做reset slave ‘通道标识’命令，就是复制通道还不能重置，只能CHANGE MASTER来改，不是做不了，因为暂时我们没这个需求，等稳定了再考虑这个细节。
2. 数据冲突没有检测。这个是无法解决的，我只是简单的调用了启动Slave的函数来启动多个复制线程，binlog取到本地应用，有数据冲突是不能事先检测的，执行到了才会报出来，可以设置skip-slave-error，对全局有效。其他复制相关的也是全局有效。

最新版patch
已经修改了缺陷1，可以reset slave了。

一个InnoDB性能超过Oracle的调优Case

P.Linux — Sun, 22 Jan 2012 16:00:59 +0000

年前抽空到兄弟公司支援了一下Oracle迁移MySQL的测试，本想把MySQL调优到接近Oracle的性能即可，但经过 @何_登成 @淘宝丁奇 @淘宝褚霸 @淘伯松诸位大牛的指导和帮助（排名不分先后，仅按第一次为此CASE而骚扰的时间排序），不断修正方案，最终获得了比Oracle更好的性能，虽然是个特殊场景，但是我觉得意义是很广泛的，值得参考，遂记录于此。
所有涉及表结构和具体业务模型的部分全部略去，也请勿咨询，不能透露，敬请谅解。

一、测试模型：

包含12张业务表，每个事务包含12个SQL，每个SQL向一张表做INSERT，做完12个SQL即完成一个事务。

用一个C API编写的程序连接MySQL，不断执行如下操作

开始事务：START TRANSACTION;
每张表插入一行：INSERT INTO xxx VALUES (val1,val2,…); #一共12次
提交事务：COMMIT;

通过一个Shell脚本来启动32个测试程序并发测试

二、测试环境：

1. 机型：

R510
CPU：Intel(R) Xeon(R) CPU E5645 @ 2.40GHz 双路24线程
内存：6 * 8G 48G
存储：FusionIO 320G MLC

R910
CPU：Intel(R) Xeon(R) CPU E7530 @ 1.87GHz 四路48线程
内存：32* 4G 128G
存储：FusionIO 640G MLC

2. Linux配置：

单实例启动数据库：/boot/grub/menu.lst修改kernel启动参数增加numa=off
多实例启动数据库：numactl –cpunodebind=$BIND_NO –localalloc $MYSQLD

RHEL 5.4 with 2.6.18内置内核
RHEL 6.1 with 2.6.32淘宝版内核

fs.aio-max-nr = 1048576 #调整系统允许的最大异步IO队列长度
vm.nr_hugepages = 18000 #大页页数
vm.hugetlb_shm_group = 601 #允许使用大页的用户id，即mysql用户
vm.swappiness = 0 #不倾向使用SWAP

3. FusionIO配置：

启动配置：
/etc/modprobe.d/iomemory-vsl.conf
options iomemory-vsl use_workqueue=0 # 忽略Linux IO调度
options iomemory-vsl disable-msi=0 # 开启MSI中断
options iomemory-vsl use_large_pcie_rx_buffer=1 # 打开PCIE缓冲
options iomemory-vsl preallocate_memory=SN号 # 预分配管理内存

格式化配置：
fio-format -b 4K /dev/fct0 # 格式化设备为4K匹配NAND芯片页大小
mkfs.xfs -f -i attr=2 -l lazy-count=1,sectsize=4096 -b size=4096 -d sectsize=4096 -L data /dev/fioa # 调整XFS与FusionIO 4K页匹配，比较激进，需要更多稳定性测试认为这组参数充分安全

mount配置：
/dev/fioa on /data type xfs (rw,noatime,nodiratime,noikeep,nobarrier,allocsize=100M,attr2,largeio,inode64,swalloc) # FusionIO的逻辑Block是100M，所以设为100M的预扩展

4. MySQL版本和通用配置：

Percona 5.1.60-13.1 原版
Percona 5.1.60-13.1 修改版
* 允许自定义InnoDB AIO队列申请长度 (5.5_change_aio_io_limit.patch)
Percona 5.5.19-24.0 原版
* 允许innodb_flush_neighbor_pages=2来合并真正相邻的脏页合并
* Group Commit
Percona 5.5.18-23.0 修改版
* 允许自定义InnoDB AIO队列申请长度 (5.5_change_aio_io_limit.patch)
* 允许预先扩展数据文件 (5.5_innodb_extent_tablespace.patch，@淘宝丁奇贡献)
* Group Cimmit

innodb_buffer_pool_size=20G
sync_binlog=1
innodb_flush_log_at_trx_commit=1

测试并发：32

5. 修改补丁

#cat 5.5_change_aio_io_limit.patch

--- Percona-Server-5.5.18-23.0/storage/innobase/handler/ha_innodb.cc	2011-12-20 06:38:58.000000000 +0800
+++ Percona-Server-5.5.18-23.0-debug/storage/innobase/handler/ha_innodb.cc	2012-01-17 10:13:41.000000000 +0800
@@ -146,6 +146,7 @@
 static ulong innobase_commit_concurrency = 0;
 static ulong innobase_read_io_threads;
 static ulong innobase_write_io_threads;
+static ulong innobase_aio_pending_ios_per_thread; // Change AIO io_limit By P.Linux
 static long innobase_buffer_pool_instances = 1;

 static ulong innobase_page_size;
@@ -2870,6 +2871,7 @@
 	srv_n_file_io_threads = (ulint) innobase_file_io_threads;
 	srv_n_read_io_threads = (ulint) innobase_read_io_threads;
 	srv_n_write_io_threads = (ulint) innobase_write_io_threads;
+	srv_n_aio_pending_ios_per_thread = (ulint) innobase_aio_pending_ios_per_thread;

 	srv_read_ahead &= 3;
 	srv_adaptive_flushing_method %= 3;
@@ -12282,6 +12284,11 @@
   "Number of background write I/O threads in InnoDB.",
   NULL, NULL, 4, 1, 64, 0);

+static MYSQL_SYSVAR_ULONG(aio_pending_ios_per_thread, innobase_aio_pending_ios_per_thread,
+  PLUGIN_VAR_RQCMDARG | PLUGIN_VAR_READONLY,
+  "Number of AIO pending IOS per-thread in InnoDB.",
+  NULL, NULL, 4, 32, 4096, 0);
+
 static MYSQL_SYSVAR_LONG(force_recovery, innobase_force_recovery,
   PLUGIN_VAR_RQCMDARG | PLUGIN_VAR_READONLY,
   "Helps to save your data in case the disk image of the database becomes corrupt.",
--- Percona-Server-5.5.18-23.0/storage/innobase/srv/srv0srv.c	2011-12-20 06:38:57.000000000 +0800
+++ Percona-Server-5.5.18-23.0-debug/storage/innobase/srv/srv0srv.c	2012-01-17 10:23:35.000000000 +0800
@@ -242,6 +242,7 @@
 UNIV_INTERN ulint	srv_n_file_io_threads	= ULINT_MAX;
 UNIV_INTERN ulint	srv_n_read_io_threads	= ULINT_MAX;
 UNIV_INTERN ulint	srv_n_write_io_threads	= ULINT_MAX;
+UNIV_INTERN ulint   srv_n_aio_pending_ios_per_thread = ULINT_MAX; // Change AIO io_limit By P.Linux

 /* Switch to enable random read ahead. */
 UNIV_INTERN my_bool	srv_random_read_ahead	= FALSE;
--- Percona-Server-5.5.18-23.0/storage/innobase/srv/srv0start.c	2011-12-20 06:38:57.000000000 +0800
+++ Percona-Server-5.5.18-23.0-debug/storage/innobase/srv/srv0start.c	2012-01-17 10:25:12.000000000 +0800
@@ -1475,14 +1475,16 @@

 	ut_a(srv_n_file_io_threads

#cat 5.5_innodb_extent_tablespace.patch

--- Percona-Server-5.5.18-23.0/sql/sql_yacc.yy	2011-12-20 06:38:58.000000000 +0800
+++ Percona-Server-5.5.18-23.0-debug/sql/sql_yacc.yy	2012-01-17 14:45:47.000000000 +0800
@@ -3878,6 +3878,14 @@
           { 
             Lex->alter_tablespace_info->ts_alter_tablespace_type= ALTER_TABLESPACE_DROP_FILE; 
           }
+        /* innodb_extent_tablespace By P.Linux */
+        | tablespace_name
+          SET
+          opt_ts_extent_size
+          {
+            Lex->alter_tablespace_info->ts_alter_tablespace_type= ALTER_TABLESPACE_ALTER_FILE;
+          }
+        /* End */
         ;

 logfile_group_info:
--- Percona-Server-5.5.18-23.0/sql/handler.h	2011-12-20 06:38:58.000000000 +0800
+++ Percona-Server-5.5.18-23.0-debug/sql/handler.h	2012-01-17 14:29:17.000000000 +0800
@@ -501,7 +501,8 @@
 {
   TS_ALTER_TABLESPACE_TYPE_NOT_DEFINED = -1,
   ALTER_TABLESPACE_ADD_FILE = 1,
-  ALTER_TABLESPACE_DROP_FILE = 2
+  ALTER_TABLESPACE_DROP_FILE = 2,
+  ALTER_TABLESPACE_ALTER_FILE = 3 // innodb_extent_tablespace By P.Linux
 };

 enum tablespace_access_mode
--- Percona-Server-5.5.18-23.0/storage/innobase/fil/fil0fil.c	2011-12-20 06:38:57.000000000 +0800
+++ Percona-Server-5.5.18-23.0-debug/storage/innobase/fil/fil0fil.c	2012-01-17 14:31:40.000000000 +0800
@@ -368,7 +368,8 @@
 Checks if a single-table tablespace for a given table name exists in the
 tablespace memory cache.
 @return	space id, ULINT_UNDEFINED if not found */
-static
+//static
+UNIV_INTERN // innodb_extent_tablespace By P.Linux
 ulint
 fil_get_space_id_for_table(
 /*=======================*/
@@ -4676,7 +4677,8 @@
 Checks if a single-table tablespace for a given table name exists in the
 tablespace memory cache.
 @return	space id, ULINT_UNDEFINED if not found */
-static
+//static
+UNIV_INTERN // innodb_extent_tablespace By P.Linux
 ulint
 fil_get_space_id_for_table(
 /*=======================*/
--- Percona-Server-5.5.18-23.0/storage/innobase/handler/ha_innodb.cc	2011-12-20 06:38:58.000000000 +0800
+++ Percona-Server-5.5.18-23.0-debug/storage/innobase/handler/ha_innodb.cc	2012-01-17 14:37:49.000000000 +0800
@@ -433,6 +434,12 @@
 /*=======================*/
 	uint	flags);

+/****************************************************************//**
+Alter tablespace supported in an InnoDB table. Allow setting extent space. */
+int innobase_alter_tablespace(handlerton *hton,
+                                THD* thd, st_alter_tablespace *alter_info);
+/* innodb_extent_tablespace By P.Linux */
+
 static const char innobase_hton_name[]= "InnoDB";

 /*************************************************************//**
@@ -2489,6 +2496,7 @@
         innobase_hton->flags=HTON_NO_FLAGS;
         innobase_hton->release_temporary_latches=innobase_release_temporary_latches;
 	innobase_hton->alter_table_flags = innobase_alter_table_flags;
+	innobase_hton->alter_tablespace= innobase_alter_tablespace; // innodb_extent_tablespace By P.Linux

 	ut_a(DATA_MYSQL_TRUE_VARCHAR == (ulint)MYSQL_TYPE_VARCHAR);

@@ -3146,6 +3155,33 @@
 		| HA_INPLACE_ADD_PK_INDEX_NO_READ_WRITE);
 }

+/****************************************************************//**
+Alter tablespace supported in an InnoDB table. Allow setting extent space. */
+int innobase_alter_tablespace(handlerton *hton,
+                                THD* thd, st_alter_tablespace *alter_info)
+{
+       if (alter_info->ts_alter_tablespace_type != ALTER_TABLESPACE_ALTER_FILE)
+       {
+               return HA_ADMIN_NOT_IMPLEMENTED;
+       }
+
+       ulint table_space= fil_get_space_id_for_table(alter_info->tablespace_name);
+
+       if (table_space == ULINT_UNDEFINED)
+       {
+               my_error(ER_WRONG_TABLE_NAME, MYF(0), alter_info->tablespace_name);
+               return EE_FILENOTFOUND;
+       }
+
+       ulint extent_size= alter_info->extent_size;
+       
+       ulint actual_size=0;
+       fil_extend_space_to_desired_size(&actual_size, table_space, extent_size);
+
+       return 0;
+}
+/* innodb_extent_tablespace By P.Linux */
+
 /*****************************************************************//**
 Commits a transaction in an InnoDB database. */
 static
--- Percona-Server-5.5.18-23.0/storage/innobase/include/fil0fil.h	2011-12-20 06:38:57.000000000 +0800
+++ Percona-Server-5.5.18-23.0-debug/storage/innobase/include/fil0fil.h	2012-01-17 14:39:20.000000000 +0800
@@ -744,6 +744,18 @@
 /*============================*/
 	ulint		id);	/*!< in: space id */

+/*******************************************************************//**
+Checks if a single-table tablespace for a given table name exists in the
+tablespace memory cache.
+@return        space id, ULINT_UNDEFINED if not found */
+UNIV_INTERN
+ulint
+fil_get_space_id_for_table(
+/*=======================*/
+       const char*     name);  /*!< in: table name in the standard
+                               'databasename/tablename' format */
+/* innodb_extent_tablespace By P.Linux */
+
 /*************************************************************************
 Return local hash table informations. */

三、测试结果：

1. R910 Oracle单实例

测试人：童家旺，支付宝
TPS：稳定值2000,峰值2600 （我没参与测试，也没有报告，无法确定详情）
我的补充：Oracle已经是调优的过的，请相信我们的Oracle DBA不是吃素的。我把听Oracle DBA描述的只言碎语随便写下，Oracle跑到后面TPS也是有所下降，不是能一直100%稳定，最后CPU已经吃尽了，所以基本上再怎么优化提升的幅度会比较小。

2. R910 MySQL单实例 Percona 5.1.59 原版

测试人：帝俊，支付宝
TPS：峰值1500，无法稳定（具体不祥）
测试人描述：
目前的测试数据显示，由于MySQL在checkpoint上处理跟不上，不足以持续支持1.5K/s的事务数，10MB/s的redo量下的交易创建。该负载下，FIO的写出速度为160～190MB/s，写IOPS为2～2.3k，测试FIO的写吞吐量可以到600MB/s，写IOPS有8K+，需要进一步研究如何进一步提升系统的吞吐量。

3. R910 MySQL多实例 Percona 5.1.60-13.1原版

测试人：彭立勋，B2B
TPS：峰值500*4（无法稳定），谷值100，均值450＊4
重要配置：
innodb_page_size=4K # 修改数据页大小与FusionIO匹配
innodb_log_block_size=4K # 修改日志页大小于FusionIO匹配
innodb_log_file_size=1G
innodb_log_files_in_group=3
innodb_buffer_pool_size=20G
innodb_max_dirty_pages_pct=75
innodb_flush_method=ALL_O_DIRECT # 修改文件写入方式全部为O_DIRECT
innodb_read_io_threads=2
innodb_write_io_threads=10
innodb_io_capacity=20000
innodb_extra_rsegments=16
innodb_use_purge_thread=4
innodb_adaptive_flushing_method=3 # 采用Keep_average刷新方式
innodb_flush_neighbor_pages=0 # 不为了凑顺序IO刷相邻未修改的页
测试人描述：
每颗物理CPU绑定一个MySQL实例，四个实例同时接受测试。可以看到在测试过程中，IOPS抖动很大，在4K～17K之间抖动，可以判定，是Checkpoint机制不完善导致刷新间歇性繁忙，在IO闲置的时候不能充分发挥性能。但多实例可以提升整体TPS接近Oracle的均值，说明MySQL内部可能某些常量设置不合理，或者锁定力度太粗导致单实例不能充分发挥单机性能。

4. R910 MySQL多实例 Percona 5.1.60-13.1 修改版

测试人：彭立勋，B2B
TPS：峰值1200*4，谷值0，均值950*4
重要配置：（在测试3的基础上）
innodb_aio_pending_ios_per_thread=1024
测试人描述：
经过对测试3的分析，可以发现，InnoDB已经标记了很多Page到Flush_list，但是并没有被即时的回写，可以在INNODB_BUFFER_POOL_PAGES系统表中发现很页flush_type=2，即在Flush_list中。
经过review代码，发现InnoDB申请的AIO队列的长度只有256，由常量OS_AIO_N_PENDING_IOS_PER_THREAD（os0file.h）定义。将此常量修改为InnoDB的参数后，重新测试，可以使FusionIO的IOPS达到7K～18K，IO利用率得以提升，整体性能已经超越Oracle，但存在严重的低谷，大约每10s一次。

5.R510 MySQL单实例 Percona 5.5.18-23.0 修改版

测试人：彭立勋,B2B
TPS：峰值2800，谷值2300，均值2500
重要配置：（在测试3的基础上）
innodb_aio_pending_ios_per_thread=512
alter tablespace `trade/xxx` set extent_size=5000000; # 预先扩展数据文件
测试人描述：
根据测试4的结果进行分析，需要解决的主要问题就是抖动，抖动可能是两个原因导致的，一个是Checkpoint机制不完善，一个是数据文件扩展。Checkpoint机制不完善这个暂时无法改进，只能先解决数据文件扩展上的问题，采用淘宝丁奇的方法，对MySQL增加预先扩展文件的功能，在测试前先将文件扩展至测试写满需要的大小，使测试过程中无需扩展文件。
实例测试中发现非常有效，抖动范围在2300～2800之间，可以接受。但是Buffer Pool一旦脏页写满，为了控制脏页量InnoDB就会加大刷新量，影响到TPS。实际上在脏页未满的时候，IOPS就没有用完，但是InnoDB计算刷新量并没有考虑操作系统反馈的影响信息，只是根据自己的redo产生量计算。

同时观察CPU还发现，2.6.18内核会将所有软中断发送到Core0上处理，这可能也是瓶颈之一。（当时忘记拷贝状态，这是后来确认内核问题看得，可以看这篇文章，一样的，CPU软中断实践）
03:05:17 PM CPU %user %nice %sys %iowait %irq %soft %steal %idle intr/s
03:05:18 PM all 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1014.00
03:05:18 PM 0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 100.00 1000.00

6. R510 MySQL单实例 Percona 5.5.19-24.0 原版

测试人：彭立勋，B2B
TPS：峰值3100，谷值2400，均值2700
重要配置：（在测试3的基础上）
替换内核版本为2.6.32淘宝版，使用IO中断负载均衡。
innodb_adaptive_flushing_method = 2
innodb_flush_neighbor_pages = cont
测试人描述：
采用淘宝版内核后，可以发现每个CPU都被用的比较满：(部分)
06:27:26 PM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest   %idle
06:27:27 PM  all   69.80    0.00   18.68    0.51    0.00    0.17    0.00    0.00   10.84
06:27:27 PM    0   74.75    0.00   17.17    0.00    0.00    0.00    0.00    0.00    8.08
06:27:27 PM    1   73.96    0.00   16.67    1.04    0.00    0.00    0.00    0.00    8.33
06:27:27 PM    2   73.20    0.00   17.53    1.03    0.00    0.00    0.00    0.00    8.25
06:27:27 PM    3   71.72    0.00   19.19    1.01    0.00    0.00    0.00    0.00    8.08
06:27:27 PM    4   71.43    0.00   18.37    1.02    0.00    0.00    0.00    0.00    9.18
06:27:27 PM    5   70.71    0.00   19.19    1.01    0.00    0.00    0.00    0.00    9.09

这是个好现象，说明CPU被充分用起来了，在脏页未满之前，TPS可以比较稳定的维持在3000以上。但还是老问题，脏页一满，速度就下降，到测试结束时下降为2400。

四、测试结论：

MySQL的调优与操作系统结合非常紧密，需要整体联动才能获得好的效果，InnoDB琐粒度较粗的缺陷，在代码实现简单的情况下，实际上对并发的影响不是很明显。
目前MySQL对高速硬件的利用主要缺陷是，不少常量写死，Checkpoint机制不完善，Checkpoint刷新脏页–>InnoDB AIO队列–>操作系统IO队列–>存储设备，中间任何一环存在问题，都可能导致性能下降。
InnoDB AIO队列可以通过补丁开放参数设置，这个瓶颈已经消除。
操作系统IO队列可以通过淘宝的内核补丁将中断分散到每个核上处理来解决。
目前存在最大的问题就是Checkpoint刷新脏页的机制，仅仅依赖redo产生的速度，其实硬件IO还有很多余量，但InnoDB并不知道。
如果能限定一种机型，限定一种操作系统，在MySQL内获取操作系统报告的硬件状态，自适应的决策自己的行为，这样可以充分利用系统资源，例如IO util%并不高的时候，即使脏页还没到阈值，也可以加大刷新量，充分利用IO，这样可能系统根本就达不到脏页阈值，可以一直保持搞TPS，至少可以延缓TPS下降的趋势。
抖动问题则是Oracle和MySQL都存在的问题，扩展数据文件的瞬间必然导致TPS下降，淘宝丁奇的方法可以完美解决，Oracle也是类似的方法通过预先分配表空间文件解决。

五、测试缺陷：

测试CASE不全，没有在R910上测试5.5（虽然已经超了Oracle，但R910上应该还能猛一点），没有测试5.5多实例下可以获得何种性能，没有测试5.1在2.6.32内核下的表现，没有测试不同的页大小对InnoDB的影响。
没有稳定性测试，原版+多实例属于稳定方案，其他改动是否100%不影响稳定，尚需测试。
在R910上的测试没有监控系统，也就没有图，坑爹了。

六、后续Action

在InnoDB控制刷赃页量的地方加入对系统diskstat的监控，当系统IO util%<80%的时候，增加(IO_CAPACITY-当前系统IO数-redo计算的刷新量)个页的刷新，在系统不忙的时候提前加大刷新量，期望保持TPS稳定。

七、随意补充

为什么读为主的应用不用担心IO用不完？因为读操作是同步IO，一旦请求就被发送到磁盘，所以只要并发够多，总能把IO压爆。但是写为了加速，几乎所有数据库都是先写到内存，再异步写到磁盘，当然你要是搞最大保护模式，应该也是有数据库可以直接同步写磁盘的，但是对大部分数据库都是先写内存，再异步到磁盘，所以如果异步IO这里存在设计上的瓶颈，不管加多少并发，都是徒劳，内存一旦写满，链接线程就都堵住了，要等异步IO消化完才能继续，所以对于写为主的应用，这个CASE都是很有参考价值的。

在Server层实现Kill Idle Transaction

P.Linux — Fri, 23 Dec 2011 12:43:22 +0000

本文内容遵从CC版权协议, 可以随意转载, 但必须以超链接形式标明文章原始出处和作者信息及版权声明
网址: http://www.penglixun.com/database/server_kill_idle_transaction.html

在上一篇文章里我们写了如何针对InnoDB清理空闲事务《如何杀掉空闲事务》，在@sleebin9 的提示下，这个功能不仅可以针对InnoDB，也可以用于所有MySQL的事务引擎。

如何在Server层实现呢，sql/sql_parse.cc的do_command()函数是个好函数，连接线程会循环调用do_command()来读取并执行命令，在do_command()函数中，会调用my_net_set_read_timeout(net, thd->variables.net_wait_timeout)来设置线程socket连接超时时间，于是在这里可以下手。
主要代码：

830   /*
 831     This thread will do a blocking read from the client which
 832     will be interrupted when the next command is received from
 833     the client, the connection is closed or "net_wait_timeout"
 834     number of seconds has passed
 835   */
 836   /* Add For Kill Idle Transaction By P.Linux */
 837   if (thd->active_transaction())
 838   {
 839     if (thd->variables.trx_idle_timeout > 0)
 840     {
 841       my_net_set_read_timeout(net, thd->variables.trx_idle_timeout);
 842     } else if (thd->variables.trx_readonly_idle_timeout > 0 && thd->is_readonly_trx)
 843     {
 844       my_net_set_read_timeout(net, thd->variables.trx_readonly_idle_timeout);
 845     } else if (thd->variables.trx_changes_idle_timeout > 0 && !thd->is_readonly_trx)
 846     {
 847       my_net_set_read_timeout(net, thd->variables.trx_changes_idle_timeout);
 848     } else {
 849       my_net_set_read_timeout(net, thd->variables.net_wait_timeout);
 850     }
 851   } else {
 852     my_net_set_read_timeout(net, thd->variables.net_wait_timeout);
 853   }
 854   /* End */

大家看明白了吗？其实这是偷梁换柱，本来在这里是要设置wait_timeout的，先判断线程是不是在事务里，就可以转而实现空闲事务的超时。

trx_idle_timeout 控制所有事务的超时，优先级最高
trx_changes_idle_timeout 控制非只读事务的超时
trx_readonly_idle_timeout 控制只读事务的超时

效果：

root@localhost : (none) 08:39:49> set autocommit = 0 ;
Query OK, 0 rows affected (0.00 sec)

root@localhost : (none) 08:39:56> set trx_idle_timeout = 5;
Query OK, 0 rows affected (0.00 sec)

root@localhost : (none) 08:40:17> use perf 
Database changed
root@localhost : perf 08:40:19> insert into perf (info ) values('11');
Query OK, 1 row affected (0.00 sec)

root@localhost : perf 08:40:26> select * from perf;
ERROR 2006 (HY000): MySQL server has gone away
No connection. Trying to reconnect...
Connection id:    6
Current database: perf

+----+------+
| id | info |
+----+------+
|  7 | aaaa |
|  9 | aaaa |
| 11 | aaaa |
+----+------+
3 rows in set (0.00 sec)

完整的patch这里下载：
Note: There is a file embedded within this post, please visit this post to download the file.

如何杀掉空闲事务

P.Linux — Mon, 28 Nov 2011 17:01:19 +0000

本文内容遵从CC版权协议, 可以随意转载, 但必须以超链接形式标明文章原始出处和作者信息及版权声明
网址: http://www.penglixun.com/database/how_to_kill_idle_trx.html

我们经常遇到一个情况，就是网络断开或程序Bug导致COMMIT/ROLLBACK语句没有传到数据库，也没有释放线程，但是线上事务锁定等待严重，连接数暴涨，尤其在测试库这种情况很多，线上也偶有发生，于是想为MySQL增加一个杀掉空闲事务的功能。

那么如何实现呢，通过MySQL Server层有很多不确定因素，最保险还是在存储引擎层实现，我们用的几乎都是InnoDB/XtraDB，所以就基于Percona来修改了，Oracle版的MySQL也可以照着修改。

需求：
1. 一个事务启动，如果事务内最后一个语句执行完超过一个时间(innodb_idle_trx_timeout)，就应该关闭链接。
2. 如果事务是纯读事务，因为不加锁，所以无害，不需要关闭，保持即可。
虽然这个思路被Percona的指出Alexey Kopytov可能存在“Even though SELECT queries do not place row locks by default (there are exceptions), they can still block undo log records from being purged.”的问题，但是我们确实有场景SELECT是绝对不能kill的，除非之后的INSERT/UPDATE/DELETE发生了，所以我根据我们的业务特点来修改。
跟Percona的Yasufumi Kinoshita和Alexey Kopytov提出过纯SELECT事务不应被kill，但通过一个参数控制的方案还没有被Alexey Kopytov接受，作为通用处理我提出了用两个变量分别控制纯读事务的空闲超时时间和有锁事务的空闲超时时间，还在等待Percona的回复，因为这个方案还在测试，就先不开放修改了，当然如果你很熟悉MYSQL源码，我提出这个思路你肯定知道怎么分成这两个参数控制了。

根据这两个需求我们来设计方法，首先想到这个功能肯定是放在InnoDB Master Thread最方便，Master Thread每秒调度一次，可以顺便检查空闲事务，然后关闭，因为在事务中操作trx->mysql_thd并不安全，所以一般来说最好在InnoDB层换成Thread ID操作，并且InnoDB中除了ha_innodb.cc，其他地方不能饮用THD，所以Master Thread中需要的线程数值，都需要在ha_innodb中计算好传递整型或布尔型返回值给master thread调用。

首先，我们要增加一个参数：idle_trx_timeout，它表示事务多久没有下一条语句发生就超时关闭。
在storage/innodb_plugin/srv/srv0srv.c的“/* plugin options */”注释下增加如下代码注册idle_trx_timeout变量。

static MYSQL_SYSVAR_LONG(idle_trx_timeout, srv_idle_trx_timeout,
  PLUGIN_VAR_RQCMDARG,
  "If zero then this function no effect, if no-zero then wait idle_trx_timeout seconds this transaction will be closed",
  "Seconds of Idle-Transaction timeout",
  NULL, NULL, 0, 0, LONG_MAX, 0);

代码往下找在innobase_system_variables结构体内加上：

MYSQL_SYSVAR(idle_trx_timeout),

有了这个变量，我们需要在Master Thread(storage/innodb_plugin/srv/srv0srv.c )中执行检测函数查找空闲事务。在loop循环的if (sync_array_print_long_waits(&waiter, &sema)判断后加上这段判断

    if (srv_idle_trx_timeout && trx_sys) {
        trx_t*  trx;
        time_t  now;
rescan_idle:
        now = time(NULL);
        mutex_enter(&kernel_mutex);
        trx = UT_LIST_GET_FIRST(trx_sys->mysql_trx_list); # 从当前事务列表里获取第一个事务
        while (trx) { # 依次循环每个事务进行检查
            if (trx->conc_state == TRX_ACTIVE
                && trx->mysql_thd
                && innobase_thd_is_idle(trx->mysql_thd)) { # 如果事务还活着并且它的状态时空闲的

                ib_int64_t  start_time = innobase_thd_get_start_time(trx->mysql_thd); # 获取线程最后一个语句的开始时间
                ulong       thd_id = innobase_thd_get_thread_id(trx->mysql_thd); #获取线程ID，因为存储引擎内直接操作THD不安全

                if (trx->last_stmt_start != start_time) { # 如果事务最后语句起始时间不等于线程最后语句起始时间说明事务是新起的
                    trx->idle_start = now; # 更新事务的空闲起始时间
                    trx->last_stmt_start = start_time; # 更新事务的最后语句起始时间
                } else if (difftime(now, trx->idle_start) # 如果事务不是新起的，已经执行了一部分则判断空闲时间有多长了
                       > srv_idle_trx_timeout) { # 如果空闲时间超过阈值则杀掉链接
                    /* kill the session */
                    mutex_exit(&kernel_mutex);
                    thd_kill(thd_id); # 杀链接
                    goto rescan_idle;
                }
            }
            trx = UT_LIST_GET_NEXT(mysql_trx_list, trx); # 检查下一个事务
        }
        mutex_exit(&kernel_mutex);
    }

其中trx中的变量是新加的，在storage/innodb_plugin/include/trx0trx.h的trx_truct加上需要的变量：

struct trx_struct{
...
    time_t      idle_start;
    ib_int64_t  last_stmt_start;
...
}

这里有几个函数是自定义的：

ibool      innobase_thd_is_idle(const void* thd);
ib_int64_t innobase_thd_get_start_time(const void* thd);
ulong      innobase_thd_get_thread_id(const void* thd);

这些函数在ha_innodb.cc中实现，需要在storage/innodb_plugin/srv/srv0srv.c头文件定义下加上这些函数的引用形势。

然后在storage/innodb_plugin/handler/ha_innodb.cc 中定义这些函数的实现：

extern "C"
ibool
innobase_thd_is_idle(
    const void* thd)    /*!< in: thread handle (THD*) */
{
    return(((const THD*)thd)->command == COM_SLEEP);
}
extern "C"
ib_int64_t
innobase_thd_get_start_time(
    const void* thd)    /*!< in: thread handle (THD*) */
{
    return((ib_int64_t)((const THD*)thd)->start_time);
}
extern "C"
ulong
innobase_thd_get_thread_id(
        const void* thd)
{
    return(thd_get_thread_id((const THD*) thd));
}

还有最重要的thd_kill函数负责杀线程的，在sql/sql_class.cc中，找个地方定义这个函数：

void thd_kill(ulong id)
{
    THD *tmp;
    VOID(pthread_mutex_lock(&LOCK_thread_count));
    I_List_iterator it(threads);
    while ((tmp=it++))
    {
        if (tmp->command == COM_DAEMON || tmp->is_have_lock_thd == 0 ) # 如果是DAEMON线程和不含锁的线程就不要kill了
            continue;
        if (tmp->thread_id == id)
        {
            pthread_mutex_lock(&tmp->LOCK_thd_data);
            break;
        }
    }
    VOID(pthread_mutex_unlock(&LOCK_thread_count));
    if (tmp)
    {
        tmp->awake(THD::KILL_CONNECTION);
        pthread_mutex_unlock(&tmp->LOCK_thd_data);
    }
}

为了存储引擎能引用到这个函数，我们要把它定义到plugin中：
include/mysql/plugin.h和include/mysql/plugin.h中加上

void thd_kill(unsigned long id);

如何判定线程的is_have_lock_thd值？首先在THD中加上这个变量（sql/sql_class.cc）：

class THD :public Statement,
           public Open_tables_state
{
....
  uint16    is_have_lock_thd;
....
}

然后在SQL的必经之路mysql_execute_command拦上一刀，判断是有锁操作发生了还是事务提交或新起事务。

  switch (lex->sql_command) {
  case SQLCOM_REPLACE:
  case SQLCOM_REPLACE_SELECT:
  case SQLCOM_UPDATE:
  case SQLCOM_UPDATE_MULTI:
  case SQLCOM_DELETE:
  case SQLCOM_DELETE_MULTI:
  case SQLCOM_INSERT:
  case SQLCOM_INSERT_SELECT:
      thd->is_have_lock_thd = 1;
      break;
  case SQLCOM_COMMIT:
  case SQLCOM_ROLLBACK:
  case SQLCOM_XA_START:
  case SQLCOM_XA_END:
  case SQLCOM_XA_PREPARE:
  case SQLCOM_XA_COMMIT:
  case SQLCOM_XA_ROLLBACK:
  case SQLCOM_XA_RECOVER:
      thd->is_have_lock_thd = 0;
      break;
  }

为了尽可能兼容Percona的补丁，能引用的都引用了Percona的操作，有些函数调用是在层次太多看不下去了就简化了。
另外还有一个版本是我自己弄的，在THD中增加了一个last_sql_end_time，在do_command结束后更新last_sql_end_time，然后在事务中拿到THD查看last_sql_end_time就可以得出idle时间，Oracle版我还是建议这么做，不要去改trx_struct结构体了，那个感觉更危险。

MySQL的timeout那点事

P.Linux — Thu, 24 Nov 2011 05:39:20 +0000

本文内容遵从CC版权协议, 可以随意转载, 但必须以超链接形式标明文章原始出处和作者信息及版权声明
网址: http://www.penglixun.com/database/mysql_timeout.html

因为最近遇到一些超时的问题，正好就把所有的timeout参数都理一遍，首先数据库里查一下看有哪些超时：

root@localhost : test 12:55:50> show global variables like "%timeout%";
+----------------------------+--------+
| Variable_name              | Value  |
+----------------------------+--------+
| connect_timeout            | 10     |
| delayed_insert_timeout     | 300    |
| innodb_lock_wait_timeout   | 120    |
| innodb_rollback_on_timeout | ON     |
| interactive_timeout        | 172800 |
| net_read_timeout           | 30     |
| net_write_timeout          | 60     |
| slave_net_timeout          | 3600   |
| table_lock_wait_timeout    | 50     | # 这个参数已经没用了
| wait_timeout               | 172800 |
+----------------------------+--------+

我们一个个来看

connect_timeout

手册描述:
The number of seconds that the mysqld server waits for a connect packet before responding with Bad handshake. The default value is 10 seconds as of MySQL 5.1.23 and 5 seconds before that.
Increasing the connect_timeout value might help if clients frequently encounter errors of the form Lost connection to MySQL server at ‘XXX’, system error: errno.
解释：在获取链接时，等待握手的超时时间，只在登录时有效，登录成功这个参数就不管事了。主要是为了防止网络不佳时应用重连导致连接数涨太快，一般默认即可。

delayed_insert_timeout

手册描述：
How many seconds an INSERT DELAYED handler thread should wait for INSERT statements before terminating.
解释：这是为MyISAM INSERT DELAY设计的超时参数，在INSERT DELAY中止前等待INSERT语句的时间。

innodb_lock_wait_timeout

手册描述：
The timeout in seconds an InnoDB transaction may wait for a row lock before giving up. The default value is 50 seconds. A transaction that tries to access a row that is locked by another InnoDB transaction will hang for at most this many seconds before issuing the following error:

ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction

When a lock wait timeout occurs, the current statement is not executed. The current transaction is not rolled back. (To have the entire transaction roll back, start the server with the –innodb_rollback_on_timeout option, available as of MySQL 5.1.15. See also Section 13.6.12, “InnoDB Error Handling”.)
innodb_lock_wait_timeout applies to InnoDB row locks only. A MySQL table lock does not happen inside InnoDB and this timeout does not apply to waits for table locks.
InnoDB does detect transaction deadlocks in its own lock table immediately and rolls back one transaction. The lock wait timeout value does not apply to such a wait.
For the built-in InnoDB, this variable can be set only at server startup. For InnoDB Plugin, it can be set at startup or changed at runtime, and has both global and session values.
解释：描述很长，简而言之，就是事务遇到锁等待时的Query超时时间。跟死锁不一样，InnoDB一旦检测到死锁立刻就会回滚代价小的那个事务，锁等待是没有死锁的情况下一个事务持有另一个事务需要的锁资源，被回滚的肯定是请求锁的那个Query。

innodb_rollback_on_timeout

手册描述：
In MySQL 5.1, InnoDB rolls back only the last statement on a transaction timeout by default. If –innodb_rollback_on_timeout is specified, a transaction timeout causes InnoDB to abort and roll back the entire transaction (the same behavior as in MySQL 4.1). This variable was added in MySQL 5.1.15.
解释：这个参数关闭或不存在的话遇到超时只回滚事务最后一个Query，打开的话事务遇到超时就回滚整个事务。

interactive_timeout/wait_timeout

手册描述：
The number of seconds the server waits for activity on an interactive connection before closing it. An interactive client is defined as a client that uses the CLIENT_INTERACTIVE option to mysql_real_connect(). See also
解释：一个持续SLEEP状态的线程多久被关闭。线程每次被使用都会被唤醒为acrivity状态，执行完Query后成为interactive状态，重新开始计时。wait_timeout不同在于只作用于TCP/IP和Socket链接的线程，意义是一样的。

net_read_timeout / net_write_timeout

手册描述：
The number of seconds to wait for more data from a connection before aborting the read. Before MySQL 5.1.41, this timeout applies only to TCP/IP connections, not to connections made through Unix socket files, named pipes, or shared memory. When the server is reading from the client, net_read_timeout is the timeout value controlling when to abort. When the server is writing to the client, net_write_timeout is the timeout value controlling when to abort. See also slave_net_timeout.
On Linux, the NO_ALARM build flag affects timeout behavior as indicated in the description of the net_retry_count system variable.
解释：这个参数只对TCP/IP链接有效，分别是数据库等待接收客户端发送网络包和发送网络包给客户端的超时时间，这是在Activity状态下的线程才有效的参数

slave_net_timeout

手册描述：
The number of seconds to wait for more data from the master before the slave considers the connection broken, aborts the read, and tries to reconnect. The first retry occurs immediately after the timeout. The interval between retries is controlled by the MASTER_CONNECT_RETRY option for the CHANGE MASTER TO statement or –master-connect-retry option, and the number of reconnection attempts is limited by the –master-retry-count option. The default is 3600 seconds (one hour).
解释：这是Slave判断主机是否挂掉的超时设置，在设定时间内依然没有获取到Master的回应就人为Master挂掉了

在NUMA处理器绑定多实例到固定核心

P.Linux — Fri, 01 Jul 2011 12:18:00 +0000

本文内容遵从CC版权协议, 可以随意转载, 但必须以超链接形式标明文章原始出处和作者信息及版权声明
网址: http://www.penglixun.com/database/mysql_multi_using_numactl.html

另发在：http://www.mysqlops.com/2011/07/01/mysql_multi_using_numactl.html

关于NUMA的介绍我这里就不多说了，网上太多资料了，我在这篇文章要介绍的是如何在MySQL多实例场景下使用numactl来绑定各个实例到具体的物理节点上，避免跨节点分配内存和跨节点访问寄存器。

至于为何使用多实例，因为MySQL对于多处理机和大内存的利用效率不佳，采用多实例可以很大程度提高MySQL对资源的利用，详情可以看Percona的白皮书中对多实例的测试：Scaling MySQL With Virident Flash Drives and Multiple Instances of Percona Server .

numactl这个程序的用法可以参照man手册：http://linux.die.net/man/8/numactl

基本用法是“numactl [option] 程序路径”，例如我希望用numactl启动mysqld则是numactl [option] /usr/local/mysql/bin/mysqld。曾经我误以为numactl是控制某一个程序名，汗……亲手做过才明白是程序路径。

我只介绍几个重要参数
–interleave=all 这是使用交叉分配模式启动一个程序，也就是说程序可以随意跨节点用其他节点的内存，传说中这是效率最高的关闭NUMA特性的方法，只是传说。
–cpunodebind=node 这是把程序绑定在指定的node节点上运行，即使另一个物理节点是idle的，也不会去使用。
–localalloc 严格控制只在节点内分配内存，禁止分配其他节点下的内存到当前节点运行的程序。

我们启动MySQL希望的参数是 numactl –cpunodebind=node –localalloc mysqld_path
为了运维方便，我不可能每次mysql启动都这么执行，我依然希望通过/etc/init.d/mysql和mysqld_multi来管理mysql启动和关闭，于是我采用自定义启动脚本的方式。

首先编写自定义启动脚本如下：

#!/bin/sh

# Program Path
NUMACTL=`which numactl`
MYSQLD=/usr/alibaba/mysql/libexec/mysqld
PS=`which ps`
GREP=`which grep`
CUT=`which cut`
WC=`which wc`
EXPR=`which expr`

# Variables
CPU_BIND=(`$NUMACTL --show | $GREP nodebind | $CUT -d: -f2 `)   # CPU bins list
CPU_BIND_NUM=${#CPU_BIND[@]}    # How many CPU binds
MYSQLD_NUM=`$PS aux | $GREP mysqld | $GREP -v grep | $GREP '\' | $WC -l`
MYSQLD_NUM=`$EXPR $MYSQLD_NUM + 1`
BIND_NO=`$EXPR $MYSQLD_NUM % $CPU_BIND_NUM ` # Calc Which CPU to Bind

# echo CMD
echo "$NUMACTL --cpunodebind=$BIND_NO --localalloc $MYSQLD" > /tmp/mysqld.$MYSQLD_NUM

# use exec to avoid having an extra shell around.
exec $NUMACTL --cpubind=$BIND_NO --localalloc $MYSQLD "$@"

方法是查看当前有多少个mysqld进程已经存在，并且通过numactl –show判断有多少个物理节点，从而判断当前的进程应该分配给哪个节点，例如有2个物理节点，没有mysqld进程，则分配当前进程到0节点，再启动一个实例，当前已经有1个mysqld进程，则分配到1节点，再启动一个实例到0节点……依次循环。

然后在my.cnf文件中配置使用我们自己的脚本启动：

[mysqld_safe]
......
ledir=/usr/local/mysql/bin/ # 放自定义脚本的目录
mysqld=mysqld_using_numactl # 自定义脚本的名称

然后再用/etc/init.d/mysql或mysqld_multi启动mysqld进程就可以实现绑定了。
你可以先启动一个实例，然后在MySQL里做一些消耗CPU的操作，可以观察到只有一个物理节点上的core有活动，哪怕这个节点的core全是100%的利用率，另一个节点的core也全部都是闲的~

有兴趣的话赶紧尝试一下吧~

如果可以，我们一起留在宜春

P.Linux — Mon, 06 Jun 2011 13:26:50 +0000

本文内容遵从CC版权协议, 可以随意转载, 但必须以超链接形式标明文章原始出处和作者信息及版权声明
网址: http://www.penglixun.com/life/feeling/love_hometown_yichun.html

如果可以，我们一起留在宜春

不要那些所谓的理想不要那些所谓的奋斗

不想去英国、美国，读书留学不想去上海、北京，打拼奋斗

就一起留在我们熟悉的城市

每一条街道都能叫出名字每一个邻居都认识

想打个麻将，唱个歌

一个电话，半个小时

人就聚齐了

如果可以，我们一起留在宜春

无聊了一起去南昌玩一趟

开个车三四个小时就到江西首府了

过年过节几个朋友窜窜门吃顿饭

谁要是不来，打个车几分钟就到他家门口

直接拖出来

如果可以，我们一起留在宜春

嘴馋的时候

满大街吃美味的小吃

或者到鼓楼、到麻辣大王，吃个烧烤

水果出来的季节

到白马农庄摘草莓

都是一箱一箱的买

因为便宜又好吃

如果可以，我们一起留在宜春

周末的时候还能骑车满世界转悠

心血来潮就去明月山

不好就去袁山

找个野山，带着烧烤架

美滋美滋的自助烧烤

如果可以，我们一起留在宜春

冬天的时候一起堆雪人

夏天去秀江游个泳

累了就随便找个KTV、棋牌室、桌游社呼朋唤友

打打麻将，斗斗地主输赢都在这个圈子

每个人我们都熟悉知根知底

如果可以，我们一起留在宜春

等我们工作了

没有那么大的压力

不用天天加班到10点

不用没有节假日

不用周周出差

只要8点上班，5点下班

不想做饭了就找个哥们家蹭顿饭

饭后可以不用洗碗

还可以一起散散步

如果可以，我们一起留在宜春

看着朋友结婚，每个人的婚礼都能参加

等我们有了孩子

我们要让他们也天天在一起玩

让我们成为世交他们也成为世交

礼拜天领着他们去森林公园

他们看植物，我们看他们

让他干爹干妈一大堆

过年压岁钱多的拿不了

让他一出生就学普通话

而不是一出生周围就是不知道哪个地方的方言或者英语

如果可以，我们一起留在宜春

等我们老了可以天天有人陪着

走不动了还可以打麻将

商量着什么时候再去趟宜春中学

什么时候再爬趟明月山

什么时候再去鼓楼吃小吃

什么时候再……

把年轻的事情都再做一遍

如果……如果……如果可以……

MySQL中创建及优化索引组织结构的思路

P.Linux — Thu, 02 Jun 2011 08:22:26 +0000

本文内容遵从CC版权协议, 可以随意转载, 但必须以超链接形式标明文章原始出处和作者信息及版权声明
网址: http://www.penglixun.com/database/think_about_mysql_create_and_optimize_index.html

原文链接：http://www.mysqlops.com/2011/05/23/mysql%E4%B8%AD%E5%88%9B%E5%BB%BA%E5%8F%8A%E4%BC%98%E5%8C%96%E7%B4%A2%E5%BC%95%E7%BB%84%E7%BB%87%E7%BB%93%E6%9E%84%E7%9A%84%E6%80%9D%E8%B7%AF.html

【导读】
通过一个实际生产环境中的数据存取需求，分析如何设计此存储结构，如何操纵存储的数据，以及如何使操作的成本或代价更低，系统开销最小。同时，让更多初学者明白数据存储的表上索引是如何一个思路组织起来的，希望起到一个参考模板的价值作用。

测试用例描述
测试用例为B2C领域，一张用于存储用户选购物品而生成的产品订单信息表，不过去掉一些其他字段，以便用于测试，其表中的数据项也不特别描述，字段意思见表

USE `test`;
DROP TABLE IF EXISTS `test`.`goods_order`;
CREATE TABLE `goods_order`(
`order_id`        INT UNSIGNED      NOT NULL             COMMENT ‘订单单号’,
`goods_id`        INT UNSIGNED      NOT NULL DEFAULT ’0′ COMMENT ‘商品款号’,
`order_type`      TINYINT UNSIGNED  NOT NULL DEFAULT ’0′ COMMENT ‘订单类型’,
`order_status`    TINYINT UNSIGNED  NOT NULL DEFAULT ’0′ COMMENT ‘订单状态’,
`color_id`        SMALLINT  UNSIGNED NOT NULL DEFAULT ’0′ COMMENT ‘颜色id’,
`size_id`         SMALLINT  UNSIGNED NOT NULL DEFAULT ’0′ COMMENT ‘尺寸id’,
`goods_number`    MEDIUMINT  UNSIGNED NOT NULL DEFAULT ’0′ COMMENT ‘数量’,
`depot_id`        INT UNSIGNED  NOT NULL DEFAULT ’0′ COMMENT ‘仓库id’,
`packet_id`       INT UNSIGNED  NOT NULL DEFAULT ’0′ COMMENT ‘储位code’,
`gmt_create`      TIMESTAMP     NOT NULL DEFAULT ’0000-00-00 00:00:00′ COMMENT ‘添加时间’,
`gmt_modify`      TIMESTAMP     NOT NULL DEFAULT ’0000-00-00 00:00:00′ COMMENT ‘更新时间’,
PRIMARY KEY(order_id,`goods_id`)
)ENGINE=InnoDB AUTO_INCREMENT=1 CHARACTER SET ‘utf8′ COLLATE ‘utf8_general_ci’;

其中，主键信息：PRIMARY KEY(order_id,`goods_id`)，为何主键索引索引字段的顺序为：order_id,`goods_id`，而不是： `goods_id`, order_id呢？原因很简单，goods_id在订单信息表中的重复率会比order_id高，也即order_id的筛选率更高，可以减少扫描索引记录个数，从而达到更高的效率，同时，下面即将会列出的SQL也告诉我们，有部分SQL语句的WHERE字句中只出现order_id字段，为此更加坚定我们必须把字段：order_id作为联合主键索引的头部，`goods_id`为联合主键索引的尾部。

数据存储表设计的小结：
设计用于存储数据的表结构，首先要知道有哪些数据项，也即行内常说的数据流，以及各个数据项的属性，比如存储的数据类型、值域范围及长度、数据完整性等要求，从而确定数据项的属性定义。存储的数据项信息确定之后，至少进行如下三步分析：
l 首先，确定哪些数据项或组合，可以作为记录的唯一性标志；
l 其次，要确定对数据记录有哪些操作，每个操作的频率如何，对网站等类型应用，还需要区分前台操作和后台操作，也即分外部用户的操作，还是内部用户的操作；
l 最后，对作为数据记录操作的条件部分的数据项，分析其数据项的筛选率如何，也即数据项不同值占总数据记录数的比例关心，比例越接近1则是筛选率越好，以及各个值得分布率；
综上所述，再让数据修改性操作优先级别高于只读性操作，就可以创建一个满足要求且性能较好的索引组织结构。
数据的存取设计,就涉及一块非常重要的知识: 关系数据库的基础知识和关系数据理论的范式。对于范式的知识点，特别解释下，建议学到BCNF范式为止，1NF、2NF、3NF和BCNF之间的差别，各自规避的问题、存在的缺陷都要一清二楚，但是在真实的工作环境中，不要任何存取设计都想向范式靠，用一句佛语准确点表达：空即是色，色即是空。

用于生成测试数据的存储过程代码
创建索引，就离不开表存储的真实数据，为此编写一个存储过程近可能模拟真实生产环境中的数据，同时也方便大家使用此存储过程，在自己的测试环境中，真实感受验证，
存储过程代码：

DELIMITER $$
DROP PROCEDURE IF EXISTS `usp_make_data` $$
CREATE PROCEDURE `usp_make_data`()
BEGIN
DECLARE iv_goods_id INT UNSIGNED DEFAULT 0;
DECLARE iv_depot_id INT UNSIGNED DEFAULT 0;
DECLARE iv_packet_id INT UNSIGNED DEFAULT 0;

SET iv_goods_id=5000;
SET iv_depot_id=10;
SET iv_packet_id=20;

WHILE iv_goods_id>0
DO
START  TRANSACTION;
WHILE iv_depot_id>0
DO
WHILE iv_packet_id>0
DO
INSERT INTO goods_order(order_id,goods_id,order_type,order_status,color_id,size_id,goods_number,depot_id,packet_id,gmt_create,gmt_modify)
VALUES(SUBSTRING(RAND(),3,8),iv_goods_id,SUBSTRING(RAND(),3,1),SUBSTRING(RAND(),5,1)%2,SUBSTRING(RAND(),3,3),SUBSTRING(RAND(),4,3),SUBSTRING(RAND(),5,2),
iv_depot_id,SUBSTRING(RAND(),4,2)*iv_packet_id,DATE_ADD(NOW(),INTERVAL -SUBSTRING(RAND(),2,3) DAY),DATE_ADD(NOW(),INTERVAL -SUBSTRING(RAND(),3,2) DAY)
);
SET iv_packet_id=iv_packet_id-1;
END WHILE;
SET iv_packet_id=20;
SET iv_depot_id=iv_depot_id-1;
END WHILE ;

COMMIT;
SET iv_depot_id=10;
SET iv_goods_id=iv_goods_id-1;
END WHILE ;
END $$
DELIMITER ;

业务逻辑描述
l 非注册用户，或网站的注册用户不登陆，都能可选购买物品，生成订单号对应的用户UID为系统默认的；
l 订单与用户UID关联、描述等信息，存储其它的表中，通过订单号的模式关联；
l 用户的订单信息，在未付款之前都可以再修改，付款之后则无法修改；
l 已经付费的订单信息，自动发送到物流部门，进行后续工序的操作。处理完毕之后，会更新订单中涉及物品的存储位置信息；
l 定期读取部分数据到数据仓库分析系统，用于统计分析；
l 个人订单查询，前后台都有；
l 购物记录查询显示；

根据业务规则描述需要使用操纵数据的SQL语句
(1). EXPLAIN SELECT * FROM goods_order WHERE `order_id`=40918986;
(2). SELECT * FROM goods_order WHERE `order_id` IN (40918986,40717328,30923040…) ORDER BY gmt_modify DESC;
(3). UPDATE goods_order SET gmt_modify=NOW(),…. WHERE `order_id`=40717328 AND goods_id=4248;
(4). SELECT COUNT(*) FROM goods_order WHERE depot_id=0 ORDER BY gmt_modify DESC LIMIT 0,50;
(5). SELECT * FROM goods_order WHERE depot_id=6 AND packet_id=0 ORDER BY gmt_modify DESC LIMIT 0,50;
(6). SELECT COUNT(*) FROM goods_order WHERE goods_id=4248 AND order_status=0 AND order_type=1
(7). SELECT * FROM goods_order WHERE goods_id=4248 AND order_status=0 AND order_type=1 ORDER BY gmt_modify DESC LIMIT 0,50;
(8). SELECT * FROM goods_order WHERE gmt_modify>=’ 2011-04-06’;
8条SQL语句按触发其执行的用户分类：
l 前台用户点击触发的操作而会执行的SQL语句为：(1)、(2)、(3)；
l 后台内部用户点击触发的操作而会执行的SQL语句为：(1)、(2)、(3)、(4)、(5)、(6)、(7)；
l 后台系统自动定期执行：(4)、(5)、(6)、(7)，工作时间正常情况每隔15分钟执行一次，以检查是否有已付款而没有准备货物的订单、是否有收款而未发货的订单等;
l 统计分析系统定期导出数据而执行的SQL语句为：(8)，频率为每24小时一次；
我们再分析上述列出来的SQL，分为2类，一类是读操作的SQL（备注：SELECT操作），另外一类为修改性操作（备注：UPDATE、DELETE操作），分别如下：
SELECT 的WHERE子句、GROUP BY子、ORDER BY 子句和HAVING 子句中，出现的字段：
(1). order_id
(2). order_id+gmt_modify
(3). depot_id+gmt_modify
(4). depot_id+packet_id+gmt_modify
(5). goods_id+order_status+order_type
(6). goods_id+order_status+order_type+gmt_modify
(7). gmt_modify
修改性操作的WHERE子句中出现的条件字段：
(8). order_id+ goods_id

我们已经存在主键索引：PRIMARY KEY(order_id,`goods_id`)，另外考虑到此表数据的操作以SELECT和INSERT为主，UPDATE的SQL量其次，再根据上述SQL语句，为此我们可以初步确定需要创建的索引：
ALTER TABLE goods_order
ADD INDEX idx_goodsID_orderType_orderStatus_gmtmodify(goods_id,order_type,order_status,gmt_modify),
ADD INDEX idx_depotID_packetID_gmtmodify(depot_id,packet_id,gmt_modify);

总结：
文章中也分析了为何联合主键索引的顺序为：order_id,`goods_id`，再补充下作为主键的联合索引的字段属性的其他特性：字段值写入之后不变化、字段值长度短且最好为数值类型；
对于编号SQL：(8)，每天按更新日期读取一次数据的操作，以采用全表扫描的方式实现，牺牲其数据读取的性能，以减少更新字段修改日期的值而带来的索引维护开销；
对于编号SQL：(4)、(5)，考虑到每次都是读取最新的５０条记录，以及读取的数据基本上可肯定为热数据，为此不得不牺牲其中一条SQL的数据读取性能，而少创建一个联合索引，从而减少维护索引字段的IO量；
对于编号SQL：(6)、(7)，创建的联合索引，需要特别注意联合索引：idx_goodsID_orderType_orderStatus_gmtmodify(goods_id,order_type,order_status,gmt_modify)中的字段顺序，其中：
l goods_id字段的筛选率高于order_type,order_status，另外gmt_modify字段只出现在ORDER BY子句中，为此只有让goods_id字段作为联合索引的头部，以提高索引的筛选率，从而提高索引的效率，减少逻辑或物理的读。
l order_status字段只有0或1两种值，而order_type有多种，以及根据SQL语句，必须order_type出现在联合中的位置要比order_status靠近头部；
l gmt_modify字段出现在ORDER BY子句中，为此必须放到联合索引字段的最后；

最后，再梳理一下从需求到设计存储结构，再到编写SQL和创建索引结构，我们应该做的步骤：
l 整理业务产生的数据流，读取数据的方式；
l 整理清楚数据流中的每个数据项属性信息；
l 分析业务指标，推测需要存储数据的规模（备注：一定要以多少GB作为容量单位）；
l 选择可能用于支持业务的硬件设备和数据库架构；
l 把所有可能操纵数据的条件和操作类型，都整理清楚；
l 分析操纵数据条件字段各自的数据筛选率；
l 权衡各个SQL的性能和IO量，也即类似于哪个操作权重高一些，那些操作权重适当低一些；
l 创建索引组织结构；
l 收集测试和生产环境的反馈信息，优化索引组织结构；

备注：
本想再用测试环境结合业务的方式，跑一套模拟测试脚本程序，让大家更加直观地看到不同索引组织情况下，相同的SQL操作及频率，数据库服务器的处理能力和负载变化及对比信息，可惜唯一的服务器无法使用了，只好放弃。对于分析相同的SQL，走不通索引，其需要的逻辑IO和物理IO量也是一个办法，此次就不分析了，有需要的朋友可以去玩玩，另外建议初学者一定要好好阅读下mysql 手册上的相关章节内容：7.2.6. Index Merge Optimization。

MySQL删除大表更快的DROP TABLE办法

P.Linux — Thu, 02 Jun 2011 08:07:12 +0000

本文内容遵从CC版权协议, 可以随意转载, 但必须以超链接形式标明文章原始出处和作者信息及版权声明
网址: http://www.penglixun.com/database/mysql_fast_drop_table_use_hard_lin.html

原文地址：http://www.mysqlops.com/2011/05/18/mysql%E5%88%A0%E9%99%A4%E5%A4%A7%E8%A1%A8%E6%9B%B4%E5%BF%AB%E7%9A%84drop-table%E5%8A%9E%E6%B3%95.html

曾经发文介绍过，DROP table XXX ,特别是碰到大表时，
http://www.mysqlops.com/2011/02/18/mysql-drop-table-%e5%a4%84%e7%90%86%e8%bf%87%e7%a8%8b.html
在DROP TABLE 过程中，所有操作都会被HANG住。
这是因为INNODB会维护一个全局独占锁（在table cache上面），直到DROP TABLE完成才释放。
在我们常用的ext3,ext4，ntfs文件系统，要删除一个大文件（几十G，甚至几百G）还是需要点时间的。
下面我们介绍一个快速DROP table 的方法；不管多大的表,INNODB 都可以很快返回，表删除完成；
实现：巧用LINK（硬链接）

实测：

root@127.0.0.1 : test 21:38:00> show table status like ‘tt’ \G
*************************** 1. row ***************************
Name: tt
Engine: InnoDB
Version: 10
Row_format: Compact
Rows: 151789128
Avg_row_length: 72
Data_length: 11011096576
Max_data_length: 0
Index_length: 5206179840
Data_free: 7340032
Auto_increment: NULL
Create_time: 2011-05-18 14:55:08
Update_time: NULL
Check_time: NULL
Collation: utf8_general_ci
Checksum: NULL
Create_options:
Comment:
1 row in set (0.22 sec)

root@127.0.0.1 : test 21:39:34> drop table tt ;
Query OK, 0 rows affected (25.01 sec)

删除一个11G的表用时25秒左右（硬件不同，时间不同）；

下面我们来对另一个更大的表进行删除；
但之前，我们需要对这个表的数据文件做一个硬连接：

root@ # ln stock.ibd stock.id.hdlk
root@ # ls stock.* -l
-rw-rw—- 1 mysql mysql 9196 Apr 14 23:03 stock.frm
-rw-r–r– 2 mysql mysql 19096666112 Apr 15 09:55 stock.ibd
-rw-r–r– 2 mysql mysql 19096666112 Apr 15 09:55 stock.id.hdlk

你会发现stock.ibd的INODES属性变成了2；

下面我们继续来删表。

root@127.0.0.1 : test 21:44:37> show table status like ‘stock’ \G
*************************** 1. row ***************************
Name: stock
Engine: InnoDB
Version: 10
Row_format: Compact
Rows: 49916863
Avg_row_length: 356
Data_length: 17799577600
Max_data_length: 0
Index_length: 1025507328
Data_free: 4194304
Auto_increment: NULL
Create_time: 2011-05-18 14:55:08
Update_time: NULL
Check_time: NULL
Collation: utf8_general_ci
Checksum: NULL
Create_options:
Comment:
1 row in set (0.23 sec)

root@127.0.0.1 : test 21:39:34> drop table stock ;
Query OK, 0 rows affected (0.99 sec)

1秒不到就删除完成；也就是DROP TABLE不用再HANG这么久了。
但table是删除了，数据文件还在，所以你还需要最后数据文件给删除。

root # ll
total 19096666112
-rw-r–r– 2 mysql mysql 19096666112 Apr 15 09:55 stock.id.hdlk
root # rm stock.id.hdlk
虽然DROP TABLE 多绕了几步。(如果你有一个比较可靠的自运行程序（自动为大表建立硬链接，并会自动删除过期的硬链接文件），就会显得不那么繁琐。)
这样做能大大减少MYSQL HANG住的时间；相信还是值得的。

至于原理: 就是利用OS HARD LINK的原理,
当多个文件名同时指向同一个INODE时,这个INODE的引用数N>1, 删除其中任何一个文件名都会很快.
因为其直接的物理文件块没有被删除.只是删除了一个指针而已;
当INODE的引用数N=1时, 删除文件需要去把这个文件相关的所有数据块清除,所以会比较耗时;

好了. 大家试试吧.

自编译MySQL指南 2.0

P.Linux — Wed, 13 Apr 2011 06:57:42 +0000

本文内容遵从CC版权协议, 可以随意转载, 但必须以超链接形式标明文章原始出处和作者信息及版权声明
网址: http://www.penglixun.com/database/mysql_compile_reference.html

原文：http://www.mysqlops.com/2011/03/06/mysql_compile_reference.html

一般情况下，用户选择的MySQL安装方式为 RPM包或二进制压缩包，但是，通用安装包为了适应不同的软硬件平台，都会采用保守的编译方式，功能上也是选择最常用最稳定的功能编译入二进制版本。
虽然这满足了大部分用户的需求，但是有时我们仅仅需要一部分功能（例如我们不需要Query Cache，但这个模块编译时不去掉的话，运行时依然会触发其代码清理Query Cache内存池，并引发过Bug），或者有性能更好的商业编译器（例如ICC），或者对源码做了修改时，就必须采用编译的方式来安装了。

下面我们就来介绍下如何从源码编译安装MySQL。

第一部分，选择编译参数
编译MySQL需要设置两种编译参数：GCC/ICC的编译参数，MySQL的编译参数。GCC/ICC编译参数是控制编译时编译器的优化动作，MySQL编译参数是控制MySQL功能模块的处理动作。

以Xeon 5520为例，55系列是Intel的Nehalem架构处理器，为了充分挖掘它的处理能力，我们做了很多的测试来尝试一些GCC的编译参数，如何获得更高的MySQL性能。

首先看处理器支持哪些flags：

processor : 15
vendor_id : GenuineIntel
cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU E5520 @ 2.27GHz
stepping : 5
cpu MHz : 2261.088
cache size : 8192 KB
physical id : 0
siblings : 8
core id : 3
cpu cores : 4
apicid : 7
fpu : yes
fpu_exception : yes
cpuid level : 11
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm syscall nx rdtscp lm constant_tsc ida pni monitor ds_cpl vmx est tm2 cx16 xtpr popcnt lahf_lm
bogomips : 4521.98
clflush size : 64
cache_alignment : 64
address sizes : 40 bits physical, 48 bits virtual
power management:

从cupinfo的信息可以看出，支持sse/sse2/mmx这些GCC的flag，查看gcc的文档可以看到全部的优化选项：[点我查看]。

经过尝试，得到了如下编译参数，经过两周的压力测试，编译MySQL 5.1.46sp1企业版，比Percona 5.1.47 Server性能高出15%，目前也非常稳定的运行在开发测试库上。GCC版本为4.1.3，系统为RHEL 5.4 x64.

CXX=gcc \
CHOST=”x86_64-pc-linux-gnu” \
CFLAGS=” -O3 \
-fomit-frame-pointer \
-pipe \
-march=nocona \
-mfpmath=sse \
-m128bit-long-double \
-mmmx \
-msse \
-msse2 \
-maccumulate-outgoing-args \
-m64 \
-ftree-loop-linear \
-fprefetch-loop-arrays \
-freg-struct-return \
-fgcse-sm \
-fgcse-las \
-frename-registers \
-fforce-addr \
-fivopts \
-ftree-vectorize \
-ftracer \
-frename-registers \
-minline-all-stringops \
-fbranch-target-load-optimize2″ \
CXXFLAGS=”${CFLAGS}” \
./configure –prefix=/usr/soft/install/mysql-ent-official-5.1.56 \
–with-server-suffix=custom-mysql \
–with-mysqld-user=mysql \
–with-plugins=partition,blackhole,csv,heap,innobase,myisam,myisammrg \
–with-charset=utf8 \
–with-collation=utf8_general_ci \
–with-extra-charsets=gbk,gb2312,utf8,ascii \
–with-big-tables \
–with-fast-mutexes \
–with-zlib-dir=bundled \
–enable-assembler \
–enable-profiling \
–enable-local-infile \
–enable-thread-safe-client \
–with-readline \
–with-pthread \
–with-embedded-server \
–with-client-ldflags=-all-static \
–with-mysqld-ldflags=-all-static \
–without-query-cache \
–without-geometry \
–without-debug \
–without-ndb-debug

GCC参数的含义为：
-fomit-frame-pointer
对于不需要栈指针的函数就不在寄存器中保存指针，因此可以忽略存储和检索地址的代码，同时对许多函数提供一个额外的寄存器。所有”-O”级别都打开它，但仅在调试器可以不依靠栈指针运行时才有效。在AMD64平台上此选项默认打开，但是在x86平台上则默认关闭。建议显式的设置它。
-pipe
在编译过程的不同阶段之间使用管道而非临时文件进行通信，可以加快编译速度。建议使用。
-march=nocona
Xoen 55xx处理器在GCC 4.1.3
-mfpmath=sse
启用cpu支持”sse”标量浮点指令。
m128bit-long-double
指定long double为128位，pentium以上的cpu更喜欢这种标准，并且符合x86-64的ABI标准，但是却不附合i386的ABI标准。
-mmmx -msse -msse2
使用相应的扩展指令集以及内置函数
-maccumulate-outgoing-args
指定在函数引导段中计算输出参数所需最大空间，这在大部分现代cpu中是较快的方法；缺点是会明显增加二进制文件尺寸。
-m64
生成专门运行于64位环境的代码，不能运行于32位环境，仅用于x86_64[含EMT64]环境。
-ftree-loop-linear
在trees上进行线型循环转换。它能够改进缓冲性能并且允许进行更进一步的循环优化。
-fprefetch-loop-arrays
生成数组预读取指令，对于使用巨大数组的程序可以加快代码执行速度，适合数据库相关的大型软件等。具体效果如何取决于代码。
-freg-struct-return
如果struct和union足够小就通过寄存器返回，这将提高较小结构的效率。如果不够小，无法容纳在一个寄存器中，将使用内存返回。建议仅在完全使用GCC编译的系统上才使用。
-fgcse-sm
在全局公共子表达式消除之后运行存储移动，以试图将存储移出循环。
-fgcse-las
在全局公共子表达式消除之后消除多余的在存储到同一存储区域之后的加载操作。
-frename-registers \
-fforce-addr
必须将地址复制到寄存器中才能对他们进行运算。由于所需地址通常在前面已经加载到寄存器中了，所以这个选项可以改进代码。
-fivopts
在trees上执行归纳变量优化。
-ftree-vectorize
在trees上执行循环向量化。
-ftracer
执行尾部复制以扩大超级块的尺寸，它简化了函数控制流，从而允许其它的优化措施做的更好。
-frename-registers
试图驱除代码中的假依赖关系，这个选项对具有大量寄存器的机器很有效。
-minline-all-stringops
默认时GCC只将确定目的地会被对齐在至少4字节边界的字符串操作内联进程序代码。该选项启用更多的内联并且增加二进制文件的体积，但是可以提升依赖于高速 memcpy, strlen, memset 操作的程序的性能。数据库系统使用这个参数可以显著提高内存操作性能。
-fbranch-target-load-optimize2
在执行序启动以及结尾之前执行分支目标缓存器加载最佳化。

第二部分，使用TC-Malloc内存管理：
Linux下的malloc函数性能问题，想必大部分在Linux下写C的人都深有感受，纷纷利用内存池来改进内存分配效率。
Google开源的tcmalloc则改进了malloc的一些效率问题，在大量malloc和free时，操作系统的内存曲线明显比Linux下malloc函数要平稳，在大并发情况下，提升程序稳定性和性能。
一般网上都是把tcmalloc动态库加到mysqld_safe中启动，但是我们的MySQL都是静态编译的，这时候动态加载是否生效呢？所以还是静态编译入MySQL好。

编译tcmalloc先要编译libunwind：

wget http://download.savannah.gnu.org/releases/libunwind/libunwind-0.99.tar.gz
tar zxvf libunwind-0.99.tar.gz

CHOST=”x86_64-pc-linux-gnu” \
CFLAGS=” -O3 -fPIC \
-fomit-frame-pointer \
-pipe \
-march=nocona \
-mfpmath=sse \
-m128bit-long-double \
-mmmx \
-msse \
-msse2 \
-maccumulate-outgoing-args \
-m64 \
-ftree-loop-linear \
-fprefetch-loop-arrays \
-freg-struct-return \
-fgcse-sm \
-fgcse-las \
-frename-registers \
-fforce-addr \
-fivopts \
-ftree-vectorize \
-ftracer \
-frename-registers \
-minline-all-stringops \
-fbranch-target-load-optimize2″ \
CXXFLAGS=”${CFLAGS}” \
./configure && make && make install

然后编译tcmalloc：

tar zxvf google-perftools-1.7.tar.gz

CHOST=”x86_64-pc-linux-gnu” \
CFLAGS=” -O3 \
-fomit-frame-pointer \
-pipe \
-march=nocona \
-mfpmath=sse \
-m128bit-long-double \
-mmmx \
-msse \
-msse2 \
-maccumulate-outgoing-args \
-m64 \
-ftree-loop-linear \
-fprefetch-loop-arrays \
-freg-struct-return \
-fgcse-sm \
-fgcse-las \
-frename-registers \
-fforce-addr \
-fivopts \
-ftree-vectorize \
-ftracer \
-frename-registers \
-minline-all-stringops \
-fbranch-target-load-optimize2″ \
CXXFLAGS=”${CFLAGS}” \
./configure –disable-cpu-profiler \
–disable-heap-profiler \
–disable-heap-checker \
–disable-debugalloc \
–enable-minimal \
–enable-frame-pointers && make && make install

记得要把libtammloc加入系统路径，否则编译MySQL时找不到：

echo “/usr/local/lib” > /etc/ld.so.conf.d/usr_local_lib.conf
/sbin/ldconfig

最后就是编译MySQL了：

CXX=gcc \
CHOST=”x86_64-pc-linux-gnu” \
CFLAGS=” -O3 \
-fomit-frame-pointer \
-pipe \
-march=nocona \
-mfpmath=sse \
-m128bit-long-double \
-mmmx \
-msse \
-msse2 \
-maccumulate-outgoing-args \
-m64 \
-ftree-loop-linear \
-fprefetch-loop-arrays \
-freg-struct-return \
-fgcse-sm \
-fgcse-las \
-frename-registers \
-fforce-addr \
-fivopts \
-ftree-vectorize \
-ftracer \
-frename-registers \
-minline-all-stringops \
-felide-constructors \
-fno-exceptions \
-fno-rtti \
-fbranch-target-load-optimize2″ \
CXXFLAGS=”${CFLAGS}” \
LDFLAGS=” -lrt -lunwind -ltcmalloc_minimal -lstdc++ ” \
./configure –prefix=/usr/soft/install/mysql-ent-custom-5.1.49sp1 \
–with-server-suffix=-custom-edition \
–with-mysqld-user=mysql \
–with-plugins=partition,blackhole,csv,heap,innobase,myisam,myisammrg \
–with-charset=utf8 \
–with-collation=utf8_general_ci \
–with-extra-charsets=gbk,gb2312,utf8,ascii \
–with-big-tables \
–with-fast-mutexes \
–with-zlib-dir=bundled \
–enable-assembler \
–enable-profiling \
–enable-local-infile \
–enable-thread-safe-client \
–with-readline \
–with-pthread \
–with-embedded-server \
–with-mysqld-ldflags=-all-static \
–without-query-cache \
–without-geometry \
–without-debug \
–without-ndb-debug
make && make install

经过试用，大并发下内存分配和释放曲线都比Linux原生的平稳。

第三部分，尝试ICC：
ICC是Intel自己开发的多平台编译器，经过我的测试ICC在浮点运算，线程库和数学函数上的优势非常明显，原生SSE2指令集支持、Intel自己编写的线程库和数学函数库，性能没得说。
我用同一份运算PI值的代码在ICC和GCC下编译，提升比例达20%，实际在数据库中比较同一条超级复杂的聚合SQL，ICC提升达34%。
下面给出TC-Malloc + ICC + Percona从源码编译的完整方案。

第一步：编译安装libunwind

wget http://download.savannah.gnu.org/releases/libunwind/libunwind-0.99.tar.gz
tar zxvf libunwind-0.99.tar.gz

CC=icc \
CXX=icpc \
LD=xild \
AR=xiar \
CFLAGS=”-O3 -no-prec-div -ip -fp-model fast=1 -xSSE2 -axSSE2 -fPIC” \
CXXFLAGS=”${CFLAGS}” \
CPPFLAGS=” -I/usr/alibaba/icc/include ” \
./configure && make && make install

第二布：编译安装tcmalloc

wget http://google-perftools.googlecode.com/files/google-perftools-1.7.tar.gz
tar zxvf google-perftools-1.7.tar.gz

CC=icc \
CXX=icpc \
LD=xild \
AR=xiar \
CFLAGS=”-O3 -no-prec-div -ip -fp-model fast=1 -xSSE2 -axSSE2 -fPIC” \
CXXFLAGS=”${CFLAGS}” \
CPPFLAGS=” -I/usr/alibaba/icc/include ” \
./configure \
–disable-cpu-profiler \
–disable-heap-profiler \
–disable-heap-checker \
–disable-debugalloc \
–enable-minimal \
–enable-frame-pointers && make && make install

echo “/usr/local/lib” > /etc/ld.so.conf.d/usr_local_lib.conf
/sbin/ldconfig

第三部：编译安装Percona

CC=icc \
CXX=icpc \
LD=xild \
AR=xiar \
CFLAGS=”-O3 -fno-builtin-malloc -fno-builtin-calloc -fno-builtin-realloc -fno-builtin-free -unroll2 -ip -fp-model fast=1 -restrict -fno-exceptions -fno-rtti -no-prec-div -fno-implicit-templates -static-intel -static-libgcc -static -xSSE2 -axSSE2″ \
CXXFLAGS=”${CFLAGS}” \
CPPFLAGS=” -I/usr/alibaba/icc/include ” \
LDFLAGS=” -L/usr/alibaba/icc/lib/intel64/ -lrt -lunwind -ltcmalloc_minimal -lstdc++ ” \
./configure –prefix=/usr/alibaba/install/percona-custom-5.1.55-12.6 \
–with-server-suffix=-alibaba-edition \
–with-mysqld-user=mysql \
–with-plugins=heap,innodb_plugin,myisam,partition \
–with-charset=utf8 \
–with-collation=utf8_general_ci \
–with-extra-charsets=gbk,utf8,ascii \
–with-big-tables \
–with-fast-mutexes \
–with-zlib-dir=bundled \
–with-readline \
–with-pthread \
–enable-assembler \
–enable-profiling \
–enable-local-infile \
–enable-thread-safe-client \
–without-embedded-server \
–with-mysqld-ldflags=-all-static \
–without-query-cache \
–without-geometry \
–without-debug \
–without-ndb-binlog \
–without-ndb-debug

编译完成后make && make install

评论：同学会催生“恐聚族” 攀比斗富炫耀成风

P.Linux — Sun, 06 Feb 2011 07:25:57 +0000

本文内容遵从CC版权协议, 可以随意转载, 但必须以超链接形式标明文章原始出处和作者信息及版权声明
网址: http://www.penglixun.com/life/diary/20110206.html

http://news.163.com/11/0206/03/6S67VFQM00011229.html

网易这篇文章真是说到了点子上，回家感同身受。

没有人关心生活怎么样，没有人关心工作的意义，只在乎有多少钱，甚至家长也是这样，这是一种多么病态的社会。

一个评论说到了我的感受：
国内已经完全畸形了。德国人该比中国人富得多吧，可是年轻人都买二手车，汽车排量多数都在1.0-1.4升。教授在这里绝对是富人，可是许多教授开着小破车乐颠乐颠的上班。大学的清洁工大妈在教授面前绝对不会低人一等。倒是校长在任何人面前都得客客气气。
因为人们有生活，有追求，有尊严。而国内唯一的最求就是钱和权。而且这钱和权来的越不正当，越说明“有本事”。知识不能改变命运，勤劳不能致富。发达的最有效最便捷途径就是无耻，无耻无底线。
我忍受不了这样的社会，也没有能力改变它，只能通过自己的拼搏冲出国门，换一个环境。新的一年，祝各位善良的人都能通过自己的努力，成功逃离苦海。

[译]InnoDB官方博客：InnoDB Plugin的性能和可伸缩性

P.Linux — Thu, 27 Jan 2011 07:57:30 +0000

本文内容遵从CC版权协议, 可以随意转载, 但必须以超链接形式标明文章原始出处和作者信息及版权声明
网址: http://www.penglixun.com/database/plug-in-for-performance-and-scalability.html

原文地址：http://blogs.innodb.com/wp/2009/03/plug-in-for-performance-and-scalability/

Why should you care about the latest “early adopter” release of the InnoDB Plugin, version 1.0.3? One word: performance! The release introduces these features:
为什么你应该关注最近的InnoDB Plugin 1.0.3版？一个词：性能！这个版本包括了这些特性

Enhanced concurrency & scalability: the “Google SMP patch” using atomic instructions for mutexing
增强的并发可可伸缩性：”Google 多处理机补丁” 为Mutext锁操作使用原子操作
More efficient memory allocation: ability to use more scalable platform memory allocator
更有效的内存分配：可以使用更多的可扩展内存分配器（例如tcmalloc）
Improved out-of-the-box scalability: unlimited concurrent thread execution by default
改进的即装即用扩展性：默认无限制的线程并发
Dynamic tuning: at run-time, enable or disable insert buffering and adaptive hash indexing
动态优化调整：在运行时，打开或者关闭插入缓存和自适应哈希索引

These new performance features can yield up to twice the throughput or more, depending on your workload, platform and other tuning considerations. In another post, we explore some details about these changes, but first, what do these enhancements mean for performance and scalability?
这些新的性能特新可以提升多大两倍甚至更多的的吞吐量，这依赖于你的负载，平台和其他调整事项。在另一篇文章中，我们会探讨这些改变的一些细节，但首先，我们现探讨这些性能和可扩展性的增强是什么意思，包括哪些内容

In brief, we’ve tested three different workloads (joins, DBT2 OLTP and a modified sysbench) using a memory-resident database. In all cases, the InnoDB Plugin scales significantly better than the built-in InnoDB in MySQL 5.1. And in some cases, the absolute level of performance is dramatically higher too! The charts below illustrate the kinds of performance gains we’ve measured with release 1.0.3 of the InnoDB Plugin. Your mileage may vary, of course. See the InnoDB website for all the details on these tests.
总之，我们已经使用内存驻留数据库（所有数据都载入在内存中）测试了三种不同的工作负载（关联，DBT2 OLTP和修改过的sysbench）。在所有的情况下，InnoDB Plugin的伸缩性明显优于MySQL 5.1内置的InnoDB。在一些场景中，性能提升的水平高的惊人。下面的图说明了InnoDB Plugin 1.0.3的性能提升。你的测试结果可能不同，当然可以在InnoDB网站看到所有测试的细节。

This release of the InnoDB Plugin incorporates a patch made by Ben Handy and Mark Callaghan at Google to improve multi-core scalability by using more efficient synchronization methods (mutexing and rw-locks) to reduce cpu utilization and contention. We’re grateful for this contribution, and you will be too!
这个InnoDB Plugin版本包含了Google的Ben Handy和Mark Callaghan的补丁来提升多处理机扩展性，包括使用了更有效的同步机制（Mutexing和RW-Locks）来减少CPU利用和竞争。我们非常感谢这个补丁的贡献，相信你也是。

Now to our test results …
现在来看我们的测试结果…

Joins: The following chart shows the performance gains in performing joins, comparing the built-in InnoDB in MySQL (in blue) with the InnoDB Plugin 1.0.3 (in red).
关联：下图展示了执行Join操作时的性能提升，内置InnoDB（蓝）和InnoDB Plugin 1.0.3（红）的比较。

As you can see from the blue bars in the above chart, with MySQL 5.1 using the built-in InnoDB, the total number of joins the system can execute declines as the number of concurrent users increases. In contrast, the InnoDB Plugin slightly improves performance even with one user, and maintains performance as the number of users rises. This performance improvement is due in large part to the use of atomics for mutexing in the InnoDB Plugin.
正如你在上面蓝柱上看到的，MySQL 5.1的内置InnoDB，随着并发数的增加系统的执行速度反而下降了。与此相反，InnoDB Plugin随着并发的提升处理速度甚至略有提高，并且随着用户的增长保持着这种性能。这个性能改善很大程度上是因为对Mutexing使用了原子操作。

Transaction Processing (DBT2): The following chart illustrates a scalability improvement using the OLTP read/write DBT2 benchmark, again comparing the performance of the built-in InnoDB in MySQL with the performance of InnoDB Plugin 1.0.3.
事务处理（DBT2）：下入展示了用DBT2测试OLTP读写性能的提升，再次比较了内置InnoDB和InnoDB Plugin 1.0.3的性能。

Here, the InnoDB Plugin scales better than the built-in InnoDB from 16 to 32 users and produces about 12% more throughput with 64 concurrent users, as other bottlenecks are encountered or system capacity is reached. This improvement is likewise due primarily to the changes in mutexing.
这里，InnoDB Plugin伸缩性在16增加到32线程时表现更好，产生比64线程多大约12%的吞吐量。由于其他性能瓶颈或系统容量达到基线。这个提升依然主要依赖于 Mutexing的改变。

Modified Sysbench: This test uses a version of the well-known sysbench workload, modified to include queries based on a secondary index, as suggested by Mark Callaghan of Google.
修改过的sysbench：这个测试使用了著名的sysbench，修改包括基于非主键索引的查询，由Google的 Mark Callaghan建议。

This time, the InnoDB Plugin shows significantly better scalability from 8 to 64 users than the built-in InnoDB in MySQL, yielding as much as 60% more throughput at 64 users. Like the previous examples, this improvement is largely due to the use of atomics for mutexing.
这次，InnoDB Plugin在8~64线程都展示了明显优于内置InnoDB的可伸缩性。在64并发时多大60%的性能提升！像前一个例子，这个提升依然主要靠 Mutexing的原子性。

Modified Sysbench with tcmalloc: This test uses the same modified sysbench workload, but shows the difference between the built-in InnoDB (which uses the internal InnoDB memory allocator) and the InnoDB Plugin when using a more scalable memory allocator, in this case tcmalloc.
使用tcmalloc的修改过的sysbench：这种测试使用相同的sysbench场景，但是不同于内置InnoDB 的是InnoDB Plugin使用了tcmalloc作为内存分配器。

When the new configuration parameter innodb_use_sys_malloc is set to enable use of the memory allocator tcmalloc, the InnoDB Plugin really shines! Transaction throughput continues to scale, and the actual throughput with 64 users has nearly doubled!
当设置innodb_user_sys_malloc变量为tcmalloc作为内存分配器时，InnoDB Plugin依然是亮点！事务吞吐量继续扩展，在64并发时吞吐量提升接近1倍（相对没有tcmalloc的）。

MySQL小技巧问答(一)

P.Linux — Tue, 18 Jan 2011 21:01:02 +0000

本文内容遵从CC版权协议, 可以随意转载, 但必须以超链接形式标明文章原始出处和作者信息及版权声明
网址: http://www.penglixun.com/database/mysql_some_tips_part_1.html

抽空总结一下自己操作MySQL的一些心得体会，做成MySQL小技巧问答系列，给大家作为一些案例参考，也为我自己做一些记录：

1. 在基于ROW的双Master复制下，如何快速大批量订正？
在A<->B的双Master结构下，假设只有一台提供服务，这是我们常用的架构，需要大批量订正数据，如何做最快？用存储过程一批批提交？这有很多的限制，有时候并不可以把一条或多条SQL拆成几段，怎么办呢？binlog不是很好的工具嘛?! ROW格式的binlog，Slave在应用时是直接使用Handler API，并没有走SQL解析，速度非常快，基本上是IO操作了，那么我们可以在备库上直接执行订正SQL，产生的ROW binlog传到主机，就会很快订正完，基本上都比写存储过程快。

2. ROW格式Replication如何实现不带库名的replicate-do-db？
虽然MySQL有replicate-do-db这个参数，但是在ROW格式的binlog下必须使用”db.table”的方式才能生效，USE对ROW格式是无效的。现在我有一个Instance，只需要复制Master的某几个库，但是是ROW格式，SQL都没有使用db前缀，怎么办？可以这么做，把主库需要的库导出来，不需要的库导出结构即可，在Slave导入这些数据及结构，配置skip-slave-errors=all，这样Master复制过来的binlog，只要发现有库有表结构，就不会报找不到表，就不会阻塞复制，但是UPDATE/DELETE过来没有数据也会被跳过错误，间接的实现了replicate-do-db。

3. 大批量乱序数据导入InnoDB很慢如何解决？
InnoDB因为主键聚集索引的关系，如果没有主键或者主键非序列的情况下，导入会越来越慢，如何快速的迁移数据到InnoDB？借助MyISAM的力量是很靠谱的，先关闭InnoDB的Buffer Pool，把内存空出来，建一张没有任何索引的MyISAM表，然后只管插入吧，concurrent_insert=2，在文件末尾并发插入，速度刚刚的，插入完成后，ALTER TABLE把索引加上，记得还有ENGINE=InnoDB，就把MyISAM转到InnoDB了，这样的速度远比直接往InnoDB里插乱序数据来得快。

4. A<–>B–>C–>D结构切换到A<–>B, C<–>D结构出现Slave_lag一直增常如何避免？
这种情况常见与一个双Master集群分离出一套双Master集群，例如从原集群分离一部分库。过快的切换B–>C到C<–>D容易导致主备出现slave_lag，并且一直增长，原因在于A<–>B集群产生的SQL，随同server_id带到了C–>D这个M-S中，当A,B产生的SQL在C,D还没消化完成就CHANGE MASTER为C<–>D时，会导致这写SQL在C,D之间来回传输，因为C,D都认为这个SQL不是自己产生的，因而不销毁，自己执行后写入binlog，于是Slave_Lag就一直增长。
避免的方法很简单，部分写切到C后，先断开B–>C的复制，等一会，看D上已经没有Slave_Lag了，再CHANGE MASTER为C<–>D，这样A,B传过来的SQL都消化完了。

5. 表中存在很多重复数据时，如何删除这些重复数据最快？
在需要给表中某些字段加唯一索引时，而字段中又存在需要重复清理数据的问题，不少DBA都应该遇到过。一般在处理时总是想在数据库中只保留一条，其他的删除，但是这样的SQL写出来总是效率不高，怎么办？其实可以转换思路，把重复的都选出一条出来，存到一张临时表，然后删除原表中所有存在重复的，再把临时表的数据库全部插入原库，这是比较通用并且高效的做法。

MySQL Multi-Master实现方式

P.Linux — Fri, 14 Jan 2011 09:35:01 +0000

本文内容遵从CC版权协议, 可以随意转载, 但必须以超链接形式标明文章原始出处和作者信息及版权声明
网址: http://www.penglixun.com/tech/program/how_to_mysql_multi_master.html

MySQL Mutil-Master Replication喊了很久了，但是MySQL一直没有去，虽然在MySQL源码中有注释将实现Multi-Master，mi结构体也为Multi-Master做好了准备，但是却一直不见MySQL发布。
但是Multi-Master –> Slave的Repliction确实非常有用，例如一台集中备份机备份所有Master的数据。

实现Multi-Master有几种思路：
1. 修改MySQL源码：修改sql_yacc.yy, sql_lex.cc支持多Master的CHANGE MASTER TO语法，然后修改slave相关的slave.cc，支持开启多个Slave, 将slave io/ slave thread线程扩展为一个slave_list。
2. 利用mysqlbinlog之类的工具，远程注册到Master获取binlog，导入本地Slave服务器。

从效率看，肯定第一种方式效率高，但是风险太大了，并且MySQL版本更新，可能需要变动自己的代码以适应新的MySQL Source, MySQL官方的实现方式肯定是第一种，从源码中的注释可以看出他们的设计思路。但是他们考虑的问题可能是多个Master复制如何处理冲突等异常，因而迟迟不发布。

为了避免过多的入侵MySQL，我采用第二种方式，用一个脚本或者程序等等，去调用mysqlbinlog，用-R远程请求到–to-last-log，然后稍微修改一下啊mysqlbinlog的源码，在日志切换后计数一下，在输出文件末尾打上切换日志的个数，例如：

insert into a values (8)
/*!*/;
# at 1070
#110114 16:16:11 server id 3  end_log_pos 1097 	Xid = 36
COMMIT/*!*/;
DELIMITER ;
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
-- Rorate binlog count: 1

— Rorate binlog count: 1就是日志切换信息，表示切换了一次日志（即传入Master的日志号没有用完）然后tail末尾的end_pos来查看本次同步到哪里了，写到*.info的文件中。

我的脚本需要配置一个multi_master.conf文件，配好每个Master的信息，例如：

#cat multi_master.conf 
[master1]
MASTER_HOST=1.2.3.4
MASTER_USER=plx
MASTER_PASSWORD=plx
MASTER_PORT=3306
MASTER_LOG_NAME=mysql-bin
MASTER_LOG_IDX=000002
MASTER_LOG_POS=521
RELAY_LOG_DIR=/tmp/
RELAY_LOG_NAME=1-relay-bin

[master2]
MASTER_HOST=2.3.4.5
MASTER_USER=plx
MASTER_PASSWORD=plx
MASTER_PORT=3306
MASTER_LOG_NAME=mysql-bin
MASTER_LOG_IDX=000002
MASTER_LOG_POS=581
RELAY_LOG_DIR=/tmp/
RELAY_LOG_NAME=2-relay-bin

[slave]
SLAVE_USER=plx
SLAVE_PASSWORD=plx

SLAVE默认导入本地，所以没有提供主机选项。
配置文件的含义是，定义了master1和master2两个Master，名称其实只要不是slave都行，[slave]中定义了本地导入的用户名和密码。
特有的参数我解释下，没解释的跟MySQL一样，
MASTER_LOG_NAME和MASTER_LOG_IDX组成MySQL中的Master_log_file，RELAY_LOG_DIR表示取回的binlog文件放哪个目录，RELAY_LOG_NAME是Relay文件的文件名，会加上标号，跟MySQL一样，这个脚本会自动处理。
一旦执行过一次，就会生成master1.info之类的文件，来表示当前同步到哪里了，例如下面这个例子：

MASTER_LOG_POS=1482
NAME=master1
MASTER_USER=plx
RELAY_LOG_NAME=1-relay-bin
MASTER_LOG_IDX=2
MASTER_HOST=1.2.3.4
MASTER_LOG_NAME=mysql-bin
MASTER_PORT=3306
RELAY_LOG_DIR=/tmp/
MASTER_PASSWORD=plx
RELAY_LOG_IDX=3

只有找不到*.info的时候，才会使用multi_master.conf。

现在每次调度multi_master_repl.pl都只会运行一次，可以不断的调度multi_master_repl.pl，因为还没有完全搞定KILL信号在Perl脚本的处理，用C重写后会解决，不能暴力kill -9，会导致不知道复制到哪里了。

这是下载地址，切勿用在生产环境，这只是个验证想法的程序。

Note: There is a file embedded within this post, please visit this post to download the file.

下一步我想用C重新实现，在mysqlbinlog源码基础上修改，获取到的日志直接写入到sock或直接导入远程mysql，避免多写一次文件，也欢迎提供新思路。

这是一次执行的日志：

#./multi_master_repl.pl 
(DEBUG) Enter: get_config()
	Info: begin
	(DEBUG) get_config --> master1
	(DEBUG) get_config --> multi_master.conf --> master1:MASTER_HOST=1.2.3.4
	(DEBUG) get_config --> multi_master.conf --> master1:MASTER_USER=plx
	(DEBUG) get_config --> multi_master.conf --> master1:MASTER_PASSWORD=plx
	(DEBUG) get_config --> multi_master.conf --> master1:MASTER_PORT=3306
	(DEBUG) get_config --> multi_master.conf --> master1:MASTER_LOG_NAME=mysql-bin
	(DEBUG) get_config --> multi_master.conf --> master1:MASTER_LOG_IDX=000002
	(DEBUG) get_config --> multi_master.conf --> master1:MASTER_LOG_POS=521
	(DEBUG) get_config --> multi_master.conf --> master1:RELAY_LOG_DIR=/tmp/
	(DEBUG) get_config --> multi_master.conf --> master1:RELAY_LOG_NAME=1-relay-bin
	(DEBUG) get_config --> Found master1.info, Read it
	(DEBUG) get_config --> master1.info --> master1:MASTER_LOG_POS=1097
	(DEBUG) get_config --> master1.info --> master1:NAME=master1
	(DEBUG) get_config --> master1.info --> master1:MASTER_USER=plx
	(DEBUG) get_config --> master1.info --> master1:RELAY_LOG_NAME=1-relay-bin
	(DEBUG) get_config --> master1.info --> master1:MASTER_LOG_IDX=2
	(DEBUG) get_config --> master1.info --> master1:MASTER_HOST=1.2.3.4
	(DEBUG) get_config --> master1.info --> master1:MASTER_LOG_NAME=mysql-bin
	(DEBUG) get_config --> master1.info --> master1:MASTER_PORT=3306
	(DEBUG) get_config --> master1.info --> master1:RELAY_LOG_DIR=/tmp/
	(DEBUG) get_config --> master1.info --> master1:MASTER_PASSWORD=plx
	(DEBUG) get_config --> master1.info --> master1:RELAY_LOG_IDX=2
	(DEBUG) get_config --> Push[master1] to Master_Info_List
	(DEBUG) get_config --> master2
	(DEBUG) get_config --> multi_master.conf --> master2:MASTER_HOST=2.3.4.5
	(DEBUG) get_config --> multi_master.conf --> master2:MASTER_USER=plx
	(DEBUG) get_config --> multi_master.conf --> master2:MASTER_PASSWORD=plx
	(DEBUG) get_config --> multi_master.conf --> master2:MASTER_PORT=3306
	(DEBUG) get_config --> multi_master.conf --> master2:MASTER_LOG_NAME=mysql-bin
	(DEBUG) get_config --> multi_master.conf --> master2:MASTER_LOG_IDX=000002
	(DEBUG) get_config --> multi_master.conf --> master2:MASTER_LOG_POS=581
	(DEBUG) get_config --> multi_master.conf --> master2:RELAY_LOG_DIR=/tmp/
	(DEBUG) get_config --> multi_master.conf --> master2:RELAY_LOG_NAME=2-relay-bin
	(DEBUG) get_config --> Found master2.info, Read it
	(DEBUG) get_config --> master2.info --> master2:MASTER_LOG_POS=1541
	(DEBUG) get_config --> master2.info --> master2:NAME=master2
	(DEBUG) get_config --> master2.info --> master2:MASTER_USER=plx
	(DEBUG) get_config --> master2.info --> master2:RELAY_LOG_NAME=2-relay-bin
	(DEBUG) get_config --> master2.info --> master2:MASTER_LOG_IDX=2
	(DEBUG) get_config --> master2.info --> master2:MASTER_HOST=2.3.4.5
	(DEBUG) get_config --> master2.info --> master2:MASTER_LOG_NAME=mysql-bin
	(DEBUG) get_config --> master2.info --> master2:MASTER_PORT=3306
	(DEBUG) get_config --> master2.info --> master2:RELAY_LOG_DIR=/tmp/
	(DEBUG) get_config --> master2.info --> master2:MASTER_PASSWORD=plx
	(DEBUG) get_config --> master2.info --> master2:RELAY_LOG_IDX=2
	(DEBUG) get_config --> Push[master2] to Master_Info_List
	(DEBUG) get_config --> multi_master.conf --> slave:SLAVE_USER=plx
	(DEBUG) get_config --> multi_master.conf --> slave:SLAVE_PASSWORD=plx
(DEBUG) Enter: get_config()
	Info: exit
(DEBUG) Enter: create_slave_threads()
	Info: begin
	(DEBUG) create_slave_threads --> Creating run_slave Threads...
(DEBUG) Enter: run_slave()
	Info: begin [tid: 1]
	(DEBUG) run_slave(0) --> NO KILL SIGNAL --> g_is_killed =>0
	(DEBUG) run_slave --> mysqlbinlog: ./mysqlbinlog -h1.2.3.4 -uplx -pplx -R -t --start-position=1097 mysql-bin.000002 > /tmp/1-relay-bin.000002
Warning: ./mysqlbinlog: unknown variable 'loose_default-character-set=utf8'
	(DEBUG) run_slave(0) --> NO KILL SIGNAL --> g_is_killed =>0
(DEBUG) Enter: import_to_slave()
	Info: begin [Param: p_master_idx=>0]
	(DEBUG) import_to_slave(0) --> NO KILL SIGNAL --> g_is_killed =>0
	(DEBUG) import_to_slave(0) --> Importing Relay Log /tmp/1-relay-bin.000002 To Slave...
	(DEBUG) create_slave_threads --> Created 2 run_slave Threads
(DEBUG) Enter: run_slave()
	Info: begin [tid: 2]
	(DEBUG) run_slave(1) --> NO KILL SIGNAL --> g_is_killed =>0
	(DEBUG) run_slave --> mysqlbinlog: ./mysqlbinlog -h2.3.4.5 -uplx -pplx -R -t --start-position=1541 mysql-bin.000002 > /tmp/2-relay-bin.000002
Warning: ./mysqlbinlog: unknown variable 'loose_default-character-set=utf8'
	(DEBUG) run_slave(1) --> NO KILL SIGNAL --> g_is_killed =>0
(DEBUG) Enter: import_to_slave()
	Info: begin [Param: p_master_idx=>1]
	(DEBUG) import_to_slave(1) --> NO KILL SIGNAL --> g_is_killed =>0
	(DEBUG) import_to_slave(1) --> Importing Relay Log /tmp/2-relay-bin.000002 To Slave...
(DEBUG) Enter: update_master_info()
	Info: begin [Param: p_master_idx=>0]
(DEBUG) Enter: update_master_info()
	Info: begin [Param: p_master_idx=>1]
	(DEBUG) update_master_info(0) --> Now Master-Log is mysql-bin.000002 Pos is 1482
(DEBUG) Enter: update_master_info_file()
	Info: begin [Param: p_master_idx=>0]
	(DEBUG) update_master_info_file(0) --> NO KILL SIGNAL --> g_is_killed =>0
	(DEBUG) update_master_info_file(0) --> Writing master1.info --> MASTER_LOG_POS=1482
	(DEBUG) update_master_info_file(0) --> Writing master1.info --> NAME=master1
	(DEBUG) update_master_info_file(0) --> Writing master1.info --> MASTER_USER=plx
	(DEBUG) update_master_info_file(0) --> Writing master1.info --> RELAY_LOG_NAME=1-relay-bin
	(DEBUG) update_master_info_file(0) --> Writing master1.info --> MASTER_LOG_IDX=2
	(DEBUG) update_master_info_file(0) --> Writing master1.info --> MASTER_HOST=1.2.3.4
	(DEBUG) update_master_info_file(0) --> Writing master1.info --> MASTER_LOG_NAME=mysql-bin
	(DEBUG) update_master_info_file(0) --> Writing master1.info --> MASTER_PORT=3306
	(DEBUG) update_master_info_file(0) --> Writing master1.info --> RELAY_LOG_DIR=/tmp/
	(DEBUG) update_master_info_file(0) --> Writing master1.info --> MASTER_PASSWORD=plx
	(DEBUG) update_master_info_file(0) --> Writing master1.info --> RELAY_LOG_IDX=3
	(DEBUG) update_master_info_file(0) --> Created master1.info
(DEBUG) Enter: update_master_info_file(0)
	Info: exit
(DEBUG) Enter: update_master_info(0)
	Info: exit
(DEBUG) Enter: import_to_slave(0)
	Info: exit
(DEBUG) Enter: run_slave(0)
	Info: exit
	(DEBUG) update_master_info(1) --> Now Master-Log is mysql-bin.000002 Pos is 2120
(DEBUG) Enter: update_master_info_file()
	Info: begin [Param: p_master_idx=>1]
	(DEBUG) update_master_info_file(1) --> NO KILL SIGNAL --> g_is_killed =>0
	(DEBUG) update_master_info_file(1) --> Writing master2.info --> MASTER_LOG_POS=2120
	(DEBUG) update_master_info_file(1) --> Writing master2.info --> NAME=master2
	(DEBUG) update_master_info_file(1) --> Writing master2.info --> MASTER_USER=plx
	(DEBUG) update_master_info_file(1) --> Writing master2.info --> RELAY_LOG_NAME=2-relay-bin
	(DEBUG) update_master_info_file(1) --> Writing master2.info --> MASTER_LOG_IDX=2
	(DEBUG) update_master_info_file(1) --> Writing master2.info --> MASTER_HOST=2.3.4.5
	(DEBUG) update_master_info_file(1) --> Writing master2.info --> MASTER_LOG_NAME=mysql-bin
	(DEBUG) update_master_info_file(1) --> Writing master2.info --> MASTER_PORT=3306
	(DEBUG) update_master_info_file(1) --> Writing master2.info --> RELAY_LOG_DIR=/tmp/
	(DEBUG) update_master_info_file(1) --> Writing master2.info --> MASTER_PASSWORD=plx
	(DEBUG) update_master_info_file(1) --> Writing master2.info --> RELAY_LOG_IDX=3
	(DEBUG) update_master_info_file(1) --> Created master2.info
(DEBUG) Enter: update_master_info_file(1)
	Info: exit
(DEBUG) Enter: update_master_info(1)
	Info: exit
(DEBUG) Enter: import_to_slave(1)
	Info: exit
(DEBUG) Enter: run_slave(1)
	Info: exit
(DEBUG) Enter: create_slave_threads()
	Info: exit

MySQL多个Slave同一server_id的冲突原因分析

P.Linux — Fri, 07 Jan 2011 14:37:54 +0000

本文内容遵从CC版权协议, 可以随意转载, 但必须以超链接形式标明文章原始出处和作者信息及版权声明
网址: http://www.penglixun.com/database/mysql_multi_slave_same_serverid.html

今天分析一个诡异问题，一个模拟Slave线程的程序，不断的被Master Server给kill掉，最终发现是因为有两个Slave使用同样一个server id去连接Master Server，为什么两个Slave用同一个server id会被Master Server给Kill呢？分析了源码，这源于MySQL Replication的重连机制。

我们首先看看一个Slave注册到Master会发生什么，首先Slave需要向Master发送一个COM_REGISTER_SLAVE类型的请求（sql_parse.cc）命令请求，这里Master会使用register_slave函数注册一个Slave到slave_list。

  case COM_REGISTER_SLAVE:
  {
    if (!register_slave(thd, (uchar*)packet, packet_length))
      my_ok(thd);
    break;
  }

在注册Slave线程的时候会发生什么呢？我们略去无用的代码直接看重点：（repl_failsafe.cc）

int register_slave(THD* thd, uchar* packet, uint packet_length)
{
  int res;
  SLAVE_INFO *si;
  uchar *p= packet, *p_end= packet + packet_length;
.... //省略
  if (!(si->master_id= uint4korr(p)))
    si->master_id= server_id;
  si->thd= thd;
  pthread_mutex_lock(&LOCK_slave_list);
  unregister_slave(thd,0,0); //关键在这里，先取消注册server_id相同的Slave线程
  res= my_hash_insert(&slave_list, (uchar*) si); //把新的Slave线程注册到slave_list
  pthread_mutex_unlock(&LOCK_slave_list);
  return res;
.....
}

这是什么意思呢？这就是重连机制，slave_list是一个Hash表，server_id是Key，每一个线程注册上来，需要删掉同样server_id的Slave线程，再把新的Slave线程加到slave_list表中。

线程注册上来后，请求Binlog，发送COM_BINLOG_DUMP请求，Master会发送binlog给Slave，代码如下：

  case COM_BINLOG_DUMP:
    {
      ulong pos;
      ushort flags;
      uint32 slave_server_id;

      status_var_increment(thd->status_var.com_other);
      thd->enable_slow_log= opt_log_slow_admin_statements;
      if (check_global_access(thd, REPL_SLAVE_ACL))
        break;

      /* TODO: The following has to be changed to an 8 byte integer */
      pos = uint4korr(packet);
      flags = uint2korr(packet + 4);
      thd->server_id=0; /* avoid suicide */
      if ((slave_server_id= uint4korr(packet+6))) // mysqlbinlog.server_id==0
        kill_zombie_dump_threads(slave_server_id);
      thd->server_id = slave_server_id;

      general_log_print(thd, command, "Log: '%s'  Pos: %ld", packet+10,
                      (long) pos);
      mysql_binlog_send(thd, thd->strdup(packet + 10), (my_off_t) pos, flags); //不断的发送日志给slave端
      unregister_slave(thd,1,1); //发送完成后清理Slave线程，因为执行到这一步肯定是binlog dump线程被kill了
      /*  fake COM_QUIT -- if we get here, the thread needs to terminate */
      error = TRUE;
      break;
    }

mysql_binlog_send函数在sql_repl.cc，里面是轮询Master binlog，发送给Slave。

再来简单看看unregister_slave做了什么（repl_failsafe.cc）：

void unregister_slave(THD* thd, bool only_mine, bool need_mutex)
{
  if (thd->server_id)
  {
    if (need_mutex)
      pthread_mutex_lock(&LOCK_slave_list);

    SLAVE_INFO* old_si;
    if ((old_si = (SLAVE_INFO*)hash_search(&slave_list,
                                           (uchar*)&thd->server_id, 4)) &&
        (!only_mine || old_si->thd == thd)) //拿到slave值
    hash_delete(&slave_list, (uchar*)old_si); //从slave_list中拿掉

    if (need_mutex)
      pthread_mutex_unlock(&LOCK_slave_list);
  }
}

这就可以解释同样的server_id为什么会被kill，因为一旦注册上去，就会现删除相同server_id的Slave线程，然后把当前的Slave加入，这是因为有时Slave断开了，重新请求上来，当然需要踢掉原来的线程，这就是线程重连机制。

切记，一个MySQL集群中，绝不可以出现相同server_id的实例，否则各种诡异的问题可是接踵而来。

ICC静态编译Percona

P.Linux — Thu, 06 Jan 2011 13:37:47 +0000

本文内容遵从CC版权协议, 可以随意转载, 但必须以超链接形式标明文章原始出处和作者信息及版权声明
网址: http://www.penglixun.com/database/icc_static_compile_percona.html

经过我的测试ICC在浮点运算，线程库和数学函数上的优势非常明显，原生SSE2指令集支持、Intel自己编写的线程库和数学函数库，性能没得说。
我用同一份运算PI值的代码在ICC和GCC下编译，提升比例达20%，实际在数据库中比较同一条超级复杂的聚合SQL，ICC提升达34%。

第一步：编译安装libunwind
wget http://download.savannah.gnu.org/releases/libunwind/libunwind-0.99.tar.gz
tar zxvf libunwind-0.99.tar.gz

CC=icc \
CXX=icpc \
LD=xild \
AR=xiar \
CFLAGS=”-O3 -no-prec-div -ip -xSSE2 -axSSE2″ \
CXXFLAGS=”${CFLAGS}” \
./configure && make && make install

第二布：编译安装tcmalloc
wget http://google-perftools.googlecode.com/files/google-perftools-1.6.tar.gz
tar zxvf google-perftools-1.6.tar.gz

CC=icc \
CXX=icpc \
LD=xild \
AR=xiar \
CFLAGS=”-O3 -no-prec-div -ip -xSSE2 -axSSE2″ \
CXXFLAGS=”${CFLAGS}” \
./configure –disable-debugalloc –enable-frame-pointers && make && make install

echo “/usr/local/lib” > /etc/ld.so.conf.d/usr_local_lib.conf
/sbin/ldconfig

第三部：编译安装Percona
CC=icc \
CXX=icpc \
LD=xild \
AR=xiar \
CFLAGS=”-O3 -unroll2 -ip -mp -restrict -fno-exceptions -fno-rtti -no-prec-div -fno-implicit-templates -static-intel -static-libgcc -xSSE2 -axSSE2″ \
CXXFLAGS=”${CFLAGS}” \
CPPFLAGS=” -I/usr/alibaba/icc/include ” \
LDFLAGS=” -L/usr/alibaba/icc/lib -lrt ” \
./configure –prefix=/usr/alibaba/install/percona-custom-5.1.53-12.4 \
–with-server-suffix=-alibaba-edition \
–with-mysqld-user=mysql \
–with-plugins=heap,innodb_plugin,myisam,partition \
–with-charset=utf8 \
–with-collation=utf8_general_ci \
–with-extra-charsets=gbk,utf8,ascii \
–with-big-tables \
–with-fast-mutexes \
–with-zlib-dir=bundled \
–with-readline \
–with-pthread \
–with-mysqld-ldflags=’-all-static -ltcmalloc’ \
–enable-assembler \
–enable-profiling \
–enable-local-infile \
–enable-thread-safe-client \
–without-embedded-server \
–with-client-ldflags=-all-static \
–with-mysqld-ldflags=-all-static \
–with-mysqld-ldflags=-ltcmalloc \
–without-query-cache \
–without-geometry \
–without-debug \
–without-ndb-binlog \
–without-ndb-debug
编译完成后make && make install

PostgreSQL和MySQL的对比，第1部分：表组织

P.Linux — Mon, 27 Dec 2010 12:36:04 +0000

本文内容遵从CC版权协议, 可以随意转载, 但必须以超链接形式标明文章原始出处和作者信息及版权声明
网址: http://www.penglixun.com/database/mysql-vs-postgresql-part-1-table-organization.html

翻译自：http://blogs.enterprisedb.com/2010/11/29/mysql-vs-postgresql-part-1-table-organization/
翻译不正确之处请指正。

I’m going to be starting an occasional series of blog postings comparing MySQL’s architecture to PostgreSQL’s architecture. Regular readers of this blog will already be aware that I know PostgreSQL far better than MySQL, having last used MySQL a very long time ago when both products were far less mature than they are today. So, my discussion of how PostgreSQL works will be based on first-hand knowledge, but discussion of how MySQL works will be based on research and – insofar as I’m can make it happen – discussion with people who know it better than I do. (Note: If you’re a person who knows MySQL better than I do and would like to help me avoid making stupid mistakes, drop me an email.)
我将要开始一个比较MySQL和PostgreSQL架构系列的博客。本博客的长期读者都已经知道，我最后一次使用MySQL是在很久很久以前两款产品都远不如今天的时候，所以我认为PostgreSQL远好于MySQL。因此，我讨论PostgreSQL如何工作是基于第一手资料，而对于MySQL则是基于很久以前的情况，看博客的同学有很多比我更了解MySQL。如果你是一个比我更了解MySQL的人，发现了我愚蠢的错误请给我一个邮件。

In writing these posts, I’m going to try to avoid making value judgments about which system is “better”, and instead focus on describing how the architecture differs, and maybe a bit about the advantages of each architecture. I can’t promise that it will be entirely unbiased (after all, I am a PostgreSQL committer, not a MySQL committer!) but I’m going to try to make it as unbiased as I can. Also, bearing in mind what I’ve recently been told by Baron Schwartz and Rob Wultsch, I’m going to focus completely on InnoDB and ignore MyISAM and all other storage engines. Finally, I’m going to focus on architectural differences. People might choose to use PostgreSQL because they hate Oracle, or MySQL because it’s easier to find hosting, or either product because they know it better, and that’s totally legitimate and perhaps worth talking about, but – partly in the interests of harmony among communities that ought to be allies – it’s not what I’m going to talk about here.
写这些文章，我要尽量避免作出哪个系统更好的判断，而是侧重于介绍他们架构的不同，也许是一些各种架构的优势。我不能保证这些观点是完全不带偏见的（毕竟，我是一个PostgreSQL代码的提交者，而不是MySQL的提交者），但是我会尽量做到不偏重某一个。此外，考虑到我最近已经跟Baron Schwartz和Rob Wultsch说的内容，我将完全忽略MyISAM和所有其他存储引擎，而重点关注InnoDB。最后，我将专注于架构的差异。人们有时选择使用PostgreSQL是因为他们恨甲骨文，或者选择MySQL因为它更容易找到托管服务，或其他一些产品因为他们知道它更好，并且这是完全符合授权的。这些但这不是我想要谈的。（译者注：最后一段话太绕口，翻译不了，只翻译大意）

So, all that having been said, what I’d like to talk about in this post is the way that MySQL and PostgreSQL store tables and indexes on disk. In PostgreSQL, table data and index data are stored in completely separate structures. When a new row is inserted, or when an existing row is updated, the new row is stored in any convenient place in the table. In the case of an update, we try to store the new row on the same page as the old row if there’s room; if there isn’t room or if it’s an insert, we pick a page that has adequate free space and use that, or failing all else extend the table by one page and add the new row there. Once the table row is added, we cycle through all the indexes defined for the table and add an index entry to each one pointing at the physical position of the table row. One index may happen to be the primary key, but that’s a fairly nominal distinction – all indexes are basically the same.
因此，我将说的内容是，MySQL和PostgreSQL的表和索引存储在磁盘上的方式。在 PostgreSQL，表数据和索引数据是完全分开存储的。当新行插入，或现有的行被更新，新行是表中的任何方便保存的地方保存。在更新的场景下，我们尝试在页内还有空间的情况下存储新行与旧行在同一个页上。如果没有空间，或者如果它是一个插入操作，我们将选择一个有足够空闲空间的页，使用它，或者扩展一个新页把新行放入。我们轮训表上定义的所有索引，并添加一个索引项指针指向表中新行的物理位置。这个索引也许是主键，也许是一般的索引，但是所有的所有索引都是基于一样的操作。

Under MySQL’s InnoDB, the table data and the primary key index are stored in the same data structure. As I understand it, this is what Oracle calls an index-organized table. Any additional (”secondary”) indexes refer to the primary key value of the tuple to which they point, not the physical position, which can change as leaf pages in the primary key index are split. Since this architecture requires every table to have a primary key, an internal row ID field is used as the primary key if no explicit primary key is specified.
在InnoDB中，表数据和主键索引是存在同样的数据结构中（译者注：主键聚集索引）。据我的理解，这就像Oracle的索引组织表（译者注：还是有一些区别，索引组织表完全按索引排序，但是InnoDB只按主键排序）。任何非主键索引指向主键索引的位置，而不是物理位置，所以主键索引页的页节点分裂不会导致数据改变。由于这种架构要求每个表都有一个主键，所以如果没有定义主键内部将隐含定义一个主键（译者注，内部定义的主键为6字节）。

Since Oracle supports both options, they are probably both useful. An index-organized table seems particularly likely to be useful when most lookups are by primary key, and most of the data in each row is part of the primary key anyway, either because the primary key columns are long compared with the remaining columns, or because the rows, overall, are short. Storing the whole row in the index avoids storing the same data twice (once in the index and once in the table), and the gain will be larger when the primary key is a substantial percentage of the total data. Furthermore, in this situation, the index page still holds as many, or almost as many, keys as it would if only a pointer were stored in lieu of the whole row, so one fewer random I/Os will be needed to access a given row.
由于Oracle支持两种选择（索引组织表和堆表），他们可能都非常有用。一个索引组织表似乎在多数SQL是通过主键查找，以及每行的大部分数据是主键的一部分的时候非常有用。要么因为主键列比其余的列长，或因为行总体而言是比较短的。存储整行数据在索引上避免了同样的数据存两分（一份在索引，一份在表中），但是如果主键占数据行的比例较大时，数据增益（译者注：数据+表的重复数据量）将更大。此外，在这种情况下，索引页将保存很多或几乎一样多的数据，访问数据时在索引页中就可能得到整行需要的列，所以这可以减少随机IO（译者注：覆盖索引扫描，Index Scan）。

When accessing an index-organized table via a secondary index, it may be necessary to traverse both the B-tree in the secondary-index, and the B-tree in the primary index. As a result, queries involving secondary indexes might be slower. However, since MySQL has index-only scans ( PostgreSQL does not ), it can sometimes avoid traversing the secondary index. So in MySQL, adding additional columns to an index might very well make it run faster, if it causes the index to function as a covering index for the query being executed. But in PostgreSQL, we frequently find ourselves telling users to pare down the columns in the index to the minimum set that is absolutely necessary, often resulting in dramatic performance gains. This is an interesting example of how the tuning that is right for one database may be completely wrong for another database.
当通过非主键索引访问一个索引组织表，可能需要遍历非主键索引的B树和主键索引的B树。因此，查询涉及非主键索引可能会变慢。然而，由于MySQL有Index-Scan方式（译者注：访问索引即可获得数据）而PostgreSQL没有，它有时访问非主键索引就能拿到数据。因此，在MySQL中，添加额外的列索引如果带来覆盖索引的查询计划，则很可能使SQL运行得更快（译者注：这个不完全对，索引多的话索引页分裂时的物理IO操作还是比较多的，推荐满足需求的情况下减少索引，除非你能保证覆盖索引经常被用到）。但是在PostgreSQL里，我们经常发现自己告诉用户减少索引到满足要求的最低限度时往往能带来巨大的性能提升。这是一个有趣的例子，如何调整数据库在不同的数据库中是完全相反的方法。

I’ve recently learned that neither InnoDB nor PostgreSQL supports traversing an index in physical order, only in key order. For InnoDB, this means that ALL scans are performed in key order, since the table itself is, in essence, also an index. As I understand it, this can make a large sequential scan quite slow, by defeating the operating system’s prefetch logic. In PostgreSQL, however, because tables are not index-organized, sequential scans are always performed in physical order, and don’t require looking at the indexes at all; this also means we can skip any I/O or CPU cost associated with examining non-leaf index pages. Traversing in physical order is apparently difficult from a locking perspective, although it must be possible, because Oracle supports it. It would be very useful to see this support in MySQL, and once PostgreSQL has index-only scans, it would be a useful improvement for PostgreSQL, too.
我最近获悉，PostgreSQL跟InnoDB一样也支持通过主键索引顺序遍历（译者注：InnoDB访问全表返回数据按主键顺序排列）。对于 InnoDB，这意味着所有的全表扫描是在扫描主键索引，主键索引本身就是表。据我了解，这可能导致大的顺序扫描慢很多（译者注：这个比较扯淡，在数据静止的情况下，PostgreSQL一样要通过block的指针访问下一个block，InnoDB通过页的指针访问下一个页）。在PostgreSQL，因为表不是按（主键）索引组织，顺序扫描总是按物理顺序进行，并且完全不需要访问索引，这也意味着我们可以跳过任何访问索引非叶子节点的IO或CPU开销（译者注：这位兄台应该忘记了什么是B+树）。显然按物理顺序访问是很困难的，但是肯定可以实现，因为Oracle支持。这是MySQL一个非常有用的功能，PostgreSQL一旦有了覆盖索引扫描功能，对PostgreSQL也将是非常有用的提升。

One final difficulty with an index-organized table is that you can’t add, drop, or change the primary key definition without a full-table rewrite. In PostgreSQL, on the other hand, this can be done – even while allow concurrent read and write activity. This is a fairly nominal advantage for most use cases since the primary key of a table rarely changes – I think it’s happened to me only once or twice in the last ten years – but it is useful when it does comes up.
使用索引组织表的最后一个问题是不能在不重建全表的情况下添加，删除或变更主键索引定义。反而在PostgreSQL里，这是可以做到的——即使当允许并发读写活动时。在大多数情况下（InnoDB）具有优势，因为在大多数场景下一旦定义主键不太可能更改。在我最近十年内这只碰到一次或两次——但是它真的发生时，（PostgreSQL）还是很有用的。

I hope that the above is a fair and accurate summary of the topic, but I’m sure I’ve missed a few things and covered others incompletely or in less detail than might be helpful. Please feel free to respond with a comment below or a blog post of your own if I’ve missed something.
我希望以上是这个专题比较公正和准确的总结，但我敢肯定，我已经错过了一些东西，或者覆盖一些内容不完全，缺少一些可能会有所帮助的细节。请随时反馈在下面的评论中评论您对我遗漏的一些内容的看法。

InnoDB的Master Thread调度流程

P.Linux — Tue, 14 Dec 2010 17:22:03 +0000

本文内容遵从CC版权协议, 可以随意转载, 但必须以超链接形式标明文章原始出处和作者信息及版权声明
网址: http://www.penglixun.com/database/innodb_master_thread.html

InnoDB的主要IO操作都是在Master Thread（srv0srv.c）中完成的，所以分析InnoDB的IO调度，就一定要分析Master Thread线程。

下面是我画的一张流程图，标识了整个Master Thread的调度流程。红色部分是InnoDB Plugin/XtraDB对原有InnoDB引擎的改进。
每个Process文字中最下面的括号是进行这个操作的具体函数，可以参照源代码阅读本图。

顺便解释一下“插入缓冲”（Insert Buffer）：InnoDB为了避免更新数据时更新索引损失太多性能，使用了这种称为Insert Buffer的方法来缓冲索引更新，对于非聚集索引（主键索引）、唯一索引的修改，不是每次都直接插入索引页，而是先判断要更新的这一页在不在内存中，如果不在则存入Insert Buffer，按照Master Thread的调度规则来合并非唯一索引和索引页中的叶子结点，这样经常能减少更新索引的代价。为什么要求是非唯一索引（排除主键索引和唯一索引）呢？因为唯一索引要检查记录是不是存在，所以必须把修改的记录影响的索引页读出来才知道是不是唯一，这样Insert Buffer就没意义了，反正要读出来，所以只对非唯一索引有效。
show innodb status中的“INSERT BUFFER AND ADAPITIVE HASH INDEX”里面显示了Insert Buffer的效果。

更正一部分，发现在刷新100个赃页后，InnoDB认为刷新耗时已经超过一秒了，无需等待，设置skip_sleep=TRUE，直接跳过os_pthread_sleep，进行下一次判断。

Slave SQL线程阻塞时执行Slave相关命令的风险

P.Linux — Sun, 12 Dec 2010 08:39:53 +0000

本文内容遵从CC版权协议, 可以随意转载, 但必须以超链接形式标明文章原始出处和作者信息及版权声明
网址: http://www.penglixun.com/database/slave_sql_locked_bug.html

今天做一批备机加主键的工作时，意外发现，如果有一个线程阻塞了Slave SQL线程应用日志，导致Slave SQL在Locked状态，再试图执行Slave Stop命令时，必定导致show slave status/master status等语句执行Hang死。
解决方法是只能等待锁定Slave SQL的线程结束，或者重启数据库，还没试出其他方法可以解决。已经在MySQL 5.0.68、5.1.30/34/40上重现。
搜索了Bug库，确实找到了这个bug，http://bugs.mysql.com/bug.php?id=56676，至少在5.1.50之前都会有这个问题。

查看了源码，主要是由于mi->run_lock和LOCK_active_mi两个锁导致的问题。
slave的运行流程是 start_slave_thread函数创建handler_slave_sql线程去轮询日志，handler_slave_sql调用exec_relay_log_event去应用日志事件，exec_relay_log_event又调用apply_event_and_update_pos来具体读取一个日志事件应用日志到存储引擎并更新relay-log的pos信息，最后根据读取的日志类型，调用不同类重载的XXX_log_event::do_apply_event去真正使用解出来的日志。

导致Hang住的原因是这样的：
slave_sql一旦启动成功，就会持有mi->run_lock锁，mi是Master_info的实例，记录主机信息，就是master.info的内容，mi->run_lock被持有表示mi的Slave正在运行（mi定义为Master_info *，注释里也说了，Multi Master写完后，mi是个数组，可以有每个Master分别持有锁，所以MySQL也在做这个事了），由于目前只支持单Master，所以mi的锁是全局的，即LOCK_active_mi。当一条SQL被Locked的时候，Slave SQL持有mi->run_lock，cond_wait等待不到继续进行的条件，于是运行不到if (!sql_slave_killed(thd,rli))这条语句。所以stop_slave发出kill无法被判断到，于是slave stop就Hang住了。由于stop slave持有LOCK_active_mi（关闭Slave需要保存master.info），而show slave status/show status都会先做pthread_mutex_lock(&LOCK_active_mi);因而全部堵住。
还有一个可能存在的风险，Relay_log_info类的tables_to_lock链表存了Slave要锁住的表，如果Slave不能及时继续，tables_to_lock链表就不能及时清理，会带来很多锁问题，可能引起大面积阻塞。上次有个故障，MySQL Hang死，很可能就是我们一个跳过复制错误的脚本show slave status和slave start/stop执行频率很高，突然切换主备需要建立大量连接的时候CPU上下文切换较多，释放LOCK_active_mi锁的速度就跟不上，另一些show slave status采集监控的脚本迅速阻塞，导致tables_to_lock链表不能及时释放，进而导致正常SQL执行被锁阻塞，由于变更量非常大，阻塞迅速蔓延，锁等待几乎把数据库Hang死。

所以我提醒各位，在Slave中有长SQL或Locked的SQL执行时，除show processlist;外千万不要做show slave/master status以及slave stop等slave相关命令。

handler_slave_sql循环执行：
03058 while (!sql_slave_killed(thd,rli))
03059 {
03060 thd_proc_info(thd, “Reading event from the relay log”);
03061 DBUG_ASSERT(rli->sql_thd == thd);
03062 THD_CHECK_SENTRY(thd);
03063
03064 if (saved_skip && rli->slave_skip_counter == 0)
03065 {省略
03076 }
03077
03078 if (exec_relay_log_event(thd,rli))
03079 {
03080 DBUG_PRINT(“info”, (“exec_relay_log_event() failed”));
03081 // do not scare the user if SQL thread was simply killed or stopped
03082 if (!sql_slave_killed(thd,rli))
03083 {省略
03144 }
03145 goto err;
03146 }
03147 }

show slave status命令
07409 static int show_slave_running(THD *thd, SHOW_VAR *var, char *buff)
07410 {
07411 var->type= SHOW_MY_BOOL;
07412 pthread_mutex_lock(&LOCK_active_mi);
07413 var->value= buff;
07414 *((my_bool *)buff)= (my_bool) (active_mi &&
07415 active_mi->slave_running == MYSQL_SLAVE_RUN_CONNECT &&
07416 active_mi->rli.slave_running);
07417 pthread_mutex_unlock(&LOCK_active_mi);
07418 return 0;
07419 }

清除锁定表的clear_tables_to_lcok
01222 void Relay_log_info::clear_tables_to_lock()
01223 {
01224 while (tables_to_lock)
01225 {
01226 uchar* to_free= reinterpret_cast(tables_to_lock);
01227 if (tables_to_lock->m_tabledef_valid)
01228 {
01229 tables_to_lock->m_tabledef.table_def::~table_def();
01230 tables_to_lock->m_tabledef_valid= FALSE;
01231 }
01232 tables_to_lock=
01233 static_cast(tables_to_lock->next_global);
01234 tables_to_lock_count–;
01235 my_free(to_free, MYF(MY_WME));
01236 }
01237 DBUG_ASSERT(tables_to_lock == NULL && tables_to_lock_count == 0);
01238 }

Percona对MySQL标准版本的改进

P.Linux — Mon, 06 Dec 2010 08:08:41 +0000

本文内容遵从CC版权协议, 可以随意转载, 但必须以超链接形式标明文章原始出处和作者信息及版权声明
网址: http://www.penglixun.com/database/percona_vs_mysql.html

周末有空读了下Percona XtraDB对MySQL InnoDB的改进点，这里给大家分享下。

一、对可扩展性的改进：
1. 提升Buffer Pool的扩展性
InnoDB Buffer Pool一个众所周知的问题是大并发查询执行的争用，XtraDB将Buffer Pool的全局Mutex拆成了多个Mutex以减少争用。

2. 提高InnoDB IO扩展性
XtraDB增加了许多变量去调整IO到最佳状态，包括调整checkpoint、后台读写数据文件线程数等等的参数。

3. 多个回滚段
为提供一直读，InnoDB将事务修改的数据写到回滚段。回滚段被一个独立的Mutex保护，这直接导致了写密集型的工作并发不高。在 XtraDB可以改变回滚段的数目（innodb_extra_rsegments），在写密集型操作中可以大幅度提高性能。

4. 可以更高的并发数
InnoDB在回滚段只提供了1024个回滚槽（春哥就遇到过这个瓶颈），如果回滚槽用完，新的事务将不能开始，直到有回滚槽被释放。

二、性能上的提升
1. 专用的Purge线程
在InnoDB一个事务修改的数据被写到共享表空间的undo space，所以InnoDB能提供读一致。到一个事务结束了，undo space的相应区域被释放。但是如果有很多事务，Purge线程清理空间不够快，共享表空间将急剧增长（BRMMS共享表空间巨大应该是这个原因）。这将导致性能严重下降，甚至可能用完所有的磁盘空间。XtraDB使用了一个专用的线程来清理undo space，这对undo space的清理速度可以提升很多。尽管这可能使整体的性能降低，但是可以大大提高稳定性，因而整体性能略微降低是值得的。

2. 可配置的Doublewrite缓冲
InnoDB使用了double write功能来防止数据损坏，double write的意思是，是写数据到文件前，先顺序写到到共享表空间。如果遇到一个损坏的写，InnoDB将使用这个buffer去恢复数据。尽管数据被写了两次但对性能影响通常较小，但是在一些高负载环境，doublewrite就成了瓶颈。XtraDB提供了一个选项将doublewrite buffer放在一个独立的磁盘来提升并发性能。

3. Query Cache增强
Percona提供了额外的参数来配置Query Cache，例如忽略SQL中的注释性语句来检查是否可以命中。

4. Fast InnoDB Checksum
InnoDB可以checksum所有从磁盘上读取的页，以提供防止数据损坏的额外安全保障。在XtraDB中，Percona改进算了 checksum算法，可以提供更好的性能。

5. 删除过多的函数调用
当MySQL从socket读数据时，将产生很多fcntl（针对描述符提供控制的函数）调用，导致并发性能下降。Percona移出了多于的调用。

6. 减少了Buffer Pool Mutex竞争
在InnoDB内核操作时减少了Buffer Pool之间的Mutex争用（拆分Mutex变量）

三、灵活性改进
1. 支持多种页大小
尽管InnoDB支持多种页大小，但是默认的页大小16K无法在不重新编译的情况下改变。XtraDB提供一个系统变量（innodb_page_size）来改变这个值。更小的页大小可以提升大多数OLTP系统的工作性能，更大的页通常可以提供更好的 OLAP性能。

2. 禁止Replication警告
默认的基于Statement的复制，例如NOW(),RAND()，call存储过程/函数等一些语句，或者UPDATE没有ORDER BY而使用LIMIT，可能是不安全的。在这种情况下，MySQL会发出1592警告（声明语句在Statement日志下是不安全的）。不幸的是，MySQL 5.1的一个Bug导致Server发出这个警告在一些安全的情况下。索然他不会导致任何与复制相关的问题，但是这会导致Error Log里面存在没必要的报警。这个改进可以避免这些警告。

3. 处理BLOB中的行结束符
Percona(5.1.x-12.x开始，5.1.x-11.x不支持)为MySQL客户端提供一个新的选项（no-remove- eol-carret）来处理Blob字段含\r字符的情况。

4. 复制停止恢复
当使用sql_slave_skip_counter参数时，如果一个事件组的中间某条出错了，slave将跳过所有剩余的时间操作直到这个事件组结束。表述比较困难，直接看Percona给的使用例子就明白了。
http://www.percona.com/docs/wiki/percona-server:features:replication_skip_single_statement

5. 可固定的预读区
在InnoDB中，预读（read-ahead区域）的大小是动态计算的，但是它经常是一个同样的值。XtraDB(5.1.x-12.x开始，5.1.x-11.x不支持)可以让这个这个区域的大小固定，避免无用的计算。
这是Facebook放出的补丁：http://bazaar.launchpad.net/~mysqlatfacebook/mysqlatfacebook/5.1/revision/3538

四、可靠性的改进
1. Crash后同步日志
在InnoDB中，slave复制状态存储在两个不同步的文件中(relay.index和relay.info)。如果slave因为错误状态而停止，文件将不同步，最后的事务将重新执行。Percona在XtraDB事务日志中增加了复制状态：当重启事务时，slave可以使用这个信息来实现一致性。
来自Google的补丁：http://code.google.com/p/google-mysql-tools/wiki/TransactionalReplication
这个缺陷可能导致的Bug：http://bugs.mysql.com/bug.php?id=34058

2. Too Many Connections的警告
Percona将“Too Many Connections”这个警告写入Server端的error_log，而不只是客户端报这个错。

3. 错误代码的兼容性
Percona(5.1.x-12.x开始，5.1.x-11.x不支持)提供与MySQL 5.5错误代码的兼容性，避免因为升级到5.5而带来错误码不一样的问题。

4. 文件句柄损坏的表（InnoDB）
MySQL在InnoDB有表损坏之后，所有的InnoDB表都不可用。XtraDB改进了这一点，只是disable损坏的表，数据库依然可以使用其他的表，损坏的表被锁定。

五、可管理性的提升
1. Fast InnoDB Recovery
InnoDB一直以来有个很麻烦的事情，在crash后回复InnoDB的表非常的缓慢。Percona/XtraDB因为是基于 InnoDB Plugin 1.0.8+的，也具备InnoDB Plugin快速恢复的功能。（早期的Percona版本也能看到XtraDB恢复速度比InnoDB快很多，因为XtraDB早期使用了自己开发的 Fast Revcovery）
一些测试：http://www.mysqlperformanceblog.com/2009/07/07/improving-innodb-recovery-time/

2. InnoDB 数据字段大小限制
InnoDB在自己的表缓存（Table Cache）中分配存储表定义（Table Definitions）的内存称为数据字典。默认情况下，一旦打开表，字典中表示它的内部对象将一直保存在内存中，直到表被删除或者服务器重启。如果存在很多表（例如 10万张或更多，Dubbo就有这种情况，logstat库），可能导致消耗巨大的内存有时可能达到G级别。Percona修改了这种策略，可以设置参数（innodb_dict_size_limit）来限制数据字典的大小，使InnoDB使用LRU算法来限制数据字典大小，而不是一直存在内存中，避免因为表太多而内存耗尽。

3. 展开表导入
InnoDB不像MyISAM那样可以在服务器之间拷贝单表定义文件。如果配合Xtrabackup导出，一张表可以在另一个XtraDB导入。

4. Buffer Pool使用共享内存
当Buffer Pool非常大时，重启后Warn up需要大量磁盘读写，这会消耗很多时间。通过将Buffer Pool存储在Shared Memory中，这些非是耗时的IO将会节省掉。主机重启就没办法了，得用下面的功能。

5. 导出/恢复Buffer Pool
对于使用了很大Buffer Pool的InnoDB，重启数据库很痛苦。通常需要InnoDB Buffer Pool先Warn Up再提供服务，这可能需要很久。XtraDB(5.1.x-12.x开始，5.1.x-11.x不支持)提供了命令可以把Buffer Pool的内容导入或导出，从而可以提高重启提供服务的速度。
使用方法：http://www.percona.com/docs/wiki/percona-server:features:innodb_lru_dump_restore?redirect=1

6. Fast Index Creation
快速索引创建是InnoDB Plugin的功能，只要不是主键变动，修改索引的速度比之前快很多。但是在一些场景下，这可能导致损坏。XtraDB提供参数（innodb_fast_index_creation）来选择Fast Index Creation功能是否启用，如果关闭，则使用原来的创建方法。

7. Fast Index Renaming
XtraDB（(5.1.x-12.x开始，5.1.x-11.x不支持)）扩展了ALTER TABLE命令，提供在线重命名索引功能，这样不会导致重建索引。（这对我们调整不规范索引名称非常有用）

8. 防止缓存Flashcache
Flashcache通过在SSD上缓存数据来提升性能。它工作时应该让更热的数据缓存才能能提高更好的性能，XtraDB提供了注释提示来忽略不必缓存的数据。

六、诊断问题方面的提升
1. 额外的INFORMATION_SCHEMA表
Percona/XtraDB提供额外的INFORMATION_SCHEMA表以获得数据库内部更详尽的信息，例如内部缓冲池的内容或统计信息。

2. 慢查日志扩展
Percona提供了额外的统计数据，可以通过参数启用。它可以帮助我们捕捉需要的事件尽可能详细的信息，简化了慢查分析的难度。

3. InnoDB状态显示
XtraDB整理了InnoDB Status的显示量，提供更好的可读性，状态由24个上升到48个，并且打印了被内部哈希表使用的内存量。通过新的参数可以配置的输出。

4. 计算InnoDB死锁数
当运行一饿事务性的应用程序，总会不同程度的出现死锁，只要不经常出现这并不是大的问题。InnoDB中Show InnoDB Status命令只给出了最后一次死锁额信息，当我们需要知道总的死锁数或一个单位时间的死锁量这里并不能给出。XtraDB增加了一个保存死锁量的状态变量，通过这个变量可以更好的了解我们数据库上发生的死锁。

5. 可以记录所有Server端命令（syslog）
Percona可以在syslog中记录所有运行在Server端的命令。

6. 响应时间分布
Percona提供了一份报告表明在一定间隔内在服务器上执行Query数。这个信息可以用于监控数据库性能是否稳定。

7. Show Storage Engines
Percona改变了Show Storage Egnines的输出，以表名XtraDB是不是启用。（以前XtraDB也使用InnoDB的名称输出）

8. Query Cache Mutex状态
Query Cache可能导致一些很难被检测出来的问题，Percona修改了show processlist命令，可以输出“Waiting on query cache mutex”状态。

9. 显示锁名称
“show mutex status”命令可以显示当前发生的锁定名称和os_wait值。