一、问题描述
我们经常会碰到这样的情况,某个事务执行完了未提交,后续再来一个DDL和DML操作,导致后面的session要么处于waiting for metadata lock,要么是锁等待超时。这时我们往往只能找到这个未提交的事务的事务id和session id,但是一般都处于sleep状态,不好分析事务内容到底是什么,所以通常都是粗鲁地kill这个session后解决问题,但是应用层的研发人员往往找不到到底是哪个事务引起的,后面再出现问题时还要重复kill。那这个情况下,怎么办呢?
下面我先模拟两种情况
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
mysql> create table test_lock(id int,name varchar(10)); Query OK, 0 rows affected (0.01 sec) mysql> insert into test_lock values(1,'jack'); Query OK, 1 rowaffected (0.00 sec) mysql> insert into test_lock values(2,'Jerry'); Query OK, 1 rowaffected (0.00 sec) mysql> insert into test_lock values(3,'mark'); Query OK, 1 rowaffected (0.00 sec) mysql> begin; Query OK, 0 rowsaffected (0.00 sec) mysql> update test_lock set id=123 where id=1; Query OK, 1 rowaffected (0.00 sec) Rows matched:1 Changed: 1 Warnings: 0 mysql> insert into test_lock values(4,'andy'); Query OK, 1 row affected (0.00 sec) |
这里我特意在开启事务后执行一条update,又执行了一条insert语句。
1 2 3 4 5 6 7 8 9 10 11 12 |
mysql> show variables like '%innodb_lock_w%'; +--------------------------+-------+ |Variable_name | Value | +--------------------------+-------+ |innodb_lock_wait_timeout | 5 | +--------------------------+-------+ 1 row in set(0.00 sec) mysql> update test_lock set id=1234 where id=1; ERROR 1205(HY000): Lock wait timeout exceeded; try restarting transaction mysql> alter table test_lock add column name2 varchar(50); |
这时session2一直卡住,我们再开一个窗口session3。
1 2 3 4 5 6 7 8 9 10 |
mysql> show processlist; +------+-------------+-----------+--------+---------+-------+----------------------------------+----------------------------------------------------+ | Id | User | Host | db | Command | Time | State | Info | +------+-------------+-----------+--------+---------+-------+----------------------------------+----------------------------------------------------+ | 1 | system user | | NULL | Connect | 67678 | Waiting for master to send event | NULL | | 2063 | root | localhost | sbtest | Sleep | 98 | | NULL | | 1935 | root | localhost | sbtest | Query | 8 | Waiting for table metadata lock | alter table test_lock add column name2 varchar(50) | | 1938 | root | localhost | sbtest | Query | 0 | starting | show processlist | +------+-------------+-----------+--------+---------+-------+----------------------------------+----------------------------------------------------+ 4 rows in set (0.00 sec) |
可看到ddl操作也被卡住了,之前的事务1也处于sleep状态,无法得知它到底执行了什么。
这时我们查询innodb_trx表可看到事务1也看不到sql信息。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
mysql> select * from information_schema.innodb_trx\G *************************** 1. row *************************** trx_id: 1854303 trx_state: RUNNING trx_started: 2017-03-29 13:28:06 trx_requested_lock_id: NULL trx_wait_started: NULL trx_weight: 3 trx_mysql_thread_id: 1928 trx_query: NULL trx_operation_state: NULL trx_tables_in_use: 0 trx_tables_locked: 1 trx_lock_structs: 2 trx_lock_memory_bytes: 1136 trx_rows_locked: 1 trx_rows_modified: 1 trx_concurrency_tickets: 0 trx_isolation_level: READ COMMITTED trx_unique_checks: 1 trx_foreign_key_checks: 1 trx_last_foreign_key_error: NULL trx_adaptive_hash_latched: 0 trx_adaptive_hash_timeout: 0 trx_is_read_only: 0 trx_autocommit_non_locking: 0 1 row in set (0.00 sec) |
二、解决方案
方案一
我想到的第一种方法是利用performance_schema中的相关信息查询
1 2 3 4 5 6 7 |
mysql> show variables like 'performance_schema'; +--------------------+-------+ |Variable_name | Value | +--------------------+-------+ |performance_schema | ON | +--------------------+-------+ 1 row in set(0.00 sec) |
通过查看events_statements_current表可看到每一个session正在执行的sql,哪怕它依旧执行完成了,只是没有提交。这里可看到事务1最后执行的正是updatetest_lock set id=123 where id=1;
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
mysql> select * from performance_schema.events_statements_current\G *************************** 1. row *************************** THREAD_ID: 1953 EVENT_ID: 47 END_EVENT_ID: 47 EVENT_NAME: statement/sql/insert SOURCE: socket_connection.cc:95 TIMER_START: 69507697694362000 TIMER_END: 69507697987390000 TIMER_WAIT: 293028000 LOCK_TIME: 102000000 SQL_TEXT: insert into test_lock values(4,'andy') DIGEST: b8eea4576e85ce7af54a5a643278b6cb DIGEST_TEXT: INSERT INTO `test_lock` VALUES (...) CURRENT_SCHEMA: sbtest |
不过方案1有个缺陷,一个事务可能有一组sql组成,这个方法只能看到这个事务最后执行的是什么SQL,无法看到全部。也就是说,关于information_schema.processlist和events_statements_current如何一一对应起来,可以通过performance_schema.threads表来关联,语句如下:
1 |
mysql> select * from performance_schema.events_statements_current where THREAD_ID in (select THREAD_ID from performance_schema.threads where PROCESSLIST_ID=2063)\G |
如果你是MySQL 5.7版本,可以通过查看sys.session视图和sys.processlist视图得到最后一次执行的SQL语句。
方案二
然后我想到了是不是可以用general_log的方式,一般情况下general_log不大可能打开,所以我们先打开general_log等着问题复现的方式来定位,经测试,即使事务没有提交,一样会写到general_log。
1 2 3 4 5 6 7 8 9 10 11 |
mysql> show variables like '%general%'; +------------------+-------------------------------------------+ | Variable_name | Value | +------------------+-------------------------------------------+ | general_log | OFF | | general_log_file | /data/mysql/3306/data/qs-ops-db-01.log | +------------------+-------------------------------------------+ 2 rows in set (0.00 sec) mysql> set global general_log=1; Query OK, 0 rowsaffected (0.00 sec) |
开启general日志后,只要知道了未提交事务的进程号就可以完美找到对应的SQL语句了。
1 2 3 4 5 6 7 |
$ cat /data/mysql/3306/data/qs-ops-db-01.log | grep 2063 mysqld, Version: 5.7.17-log (MySQL Community Server (GPL)). started with: Tcp port: 3306 Unix socket: /data/mysql/3306/mysql.sock Time Id Command Argument 2017-03-29T07:22:00.949233Z 2063 Query begin 2017-03-29T07:22:11.090712Z 2063 Query update test_lock set id=123 where id=1 2017-03-29T07:22:18.347311Z 2063 Query insert into test_lock values(4,'andy') |
这样只要后续能否复现的话,就能找到所有的SQL了,就是如果此会话是长连接,那么必然执行的SQL语句较多,这时候就需要慢慢排查了。
方案三
假如后面应用层最终commit了,那么会在binlog里记录,可以根据当时的session id去binlog里面查看完整事务。
想不到还有什么更好的办法了,目前只能这样了。