MySQL數據庫鏈接無法kill的問題分析

這應該是17年的問題了,博客中分享下

問題描述

由於業務sql慢查詢,急需要kill鏈接,添加索引來解決問題。但是kill命令批量下發後,數據庫內部鏈接沒有釋放。如下

| 2211362536 | ashe_ashe | **********  | ******** | Killed           |     1429 | Sending data                                                          |
 select cancel_package_id, status from `cancelRecord` where creditor = '527432228075566747' and debtor = '520617185996749208' and create_time between '2017-12-03' and '2017-12-03 23:59:59' and cancel_package_id <> '' group by cancel_package_id order by create_time desc                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
 

其實這種情況我們不應該陌生,比如在半同步複製中,事務在等待ack時,同樣無法被kill。

什麼是kill?

mysql> ? kill
Name: 'KILL'
Description:
Syntax:
KILL [CONNECTION | QUERY] processlist_id

Each connection to mysqld runs in a separate thread. You can kill a
thread with the KILL processlist_id statement.

Thread processlist identifiers can be determined from the ID column of
the INFORMATION_SCHEMA.PROCESSLIST table, the Id column of SHOW
PROCESSLIST output, and the PROCESSLIST_ID column of the Performance
Schema threads table. The value for the current thread is returned by
the CONNECTION_ID() function.

KILL permits an optional CONNECTION or QUERY modifier:

o KILL CONNECTION is the same as KILL with no modifier: It terminates
  the connection associated with the given processlist_id, after
  terminating any statement the connection is executing.

o KILL QUERY terminates the statement the connection is currently
  executing, but leaves the connection itself intact.

If you have the PROCESS privilege, you can see all threads. If you have
the SUPER privilege, you can kill all threads and statements.
Otherwise, you can see and kill only your own threads and statements.

You can also use the mysqladmin processlist and mysqladmin kill
commands to examine and kill threads.

MySQL kill操作的原理

簡單來說,連接數據庫,執行kill命令後,會設置目標線程的kill標誌爲kill query/kill connection。目標線程會在執行階段的某個時刻自檢是否被kill。如果是的話,會以適當的方式結束自己的執行。

kill query和kill connection的區別在於,後者會立刻關閉數據庫跟客戶端的tcp連接。這個時候客戶端會查詢會立刻收到錯誤信息,如果有重試機制的話,會再建連接進來,執行相同的sql。此時,即便客戶端有鏈接池控制連接數,也會出現越kill,數據庫線程越多的情況。

爲什麼kill不掉?

看下當時問題發生時記錄的信息

[dba@111111 ~]$ ll -h  /data/backup/ashe_psta.log 
-rw-r--r-- 1 root root 8.0M Dec  3 15:45 /data/backup/ashe_psta.log
[dba@111111~]$ ll -h /tmp/ashe
ashe_killed_not_free.log  ashe_kill.log             ashe.log                  
[dba@111111 ~]$ ll -h /tmp/ashe
ashe_killed_not_free.log  ashe_kill.log             ashe.log                  
[dba@111111 ~]$ ll -h /tmp/ashe*
-rw-r--r-- 1 dba ACTIONTECH 5.3M Dec  3 17:37 /tmp/ashe_killed_not_free.log
-rw-r--r-- 1 dba ACTIONTECH 1.2M Dec  3 15:44 /tmp/ashe_kill.log
-rw-r--r-- 1 dba ACTIONTECH 300K Dec  3 15:28 /tmp/ashe.log
[dba@100-107-22-9 ~]$ 

kill不掉時的processlist信息

| 2211364964 | ashe_ashe | *******   | ******** | Killed           |     1223 | Sending data                                                          | select cancel_package_id, status from `cancelRecord` where creditor = '559175404039296431' and debtor = '628277423689283626' and create_time between '2017-12-03' and '2017-12-03 23:59:59' and cancel_package_id <> '' group by cancel_package_id order by create_time desc                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   

數據庫堆棧信息

Thread 5013 (Thread 0x7f07ea061700 (LWP 56970)):
#0  0x0000003c1a4e15e3 in select () from /lib64/libc.so.6
#1  0x00000000009711ef in os_thread_sleep(unsigned long) ()
#2  0x00000000009cf55d in srv_conc_enter_innodb(trx_t*) ()
#3  0x0000000000923ed2 in ha_innobase::index_read(unsigned char*, unsigned char const*, unsigned int, ha_rkey_function) ()
#4  0x000000000058b94d in handler::ha_index_read_map(unsigned char*, unsigned char const*, unsigned long, ha_rkey_function) ()
#5  0x00000000006ce05c in join_read_always_key(st_join_table*) ()
#6  0x00000000006d0a1f in sub_select(JOIN*, st_join_table*, bool) ()
#7  0x00000000006ce8b1 in JOIN::exec() ()
#8  0x00000000007152b9 in mysql_execute_select(THD*, st_select_lex*, bool) ()
#9  0x0000000000715d7c in mysql_select(THD*, TABLE_LIST*, unsigned int, List<Item>&, Item*, SQL_I_List<st_order>*, SQL_I_List<st_order>*, Item*, unsigned long long, select_result*, st_select_lex_unit*, st_select_lex*) ()
#10 0x0000000000715f85 in handle_select(THD*, select_result*, unsigned long) ()
#11 0x00000000006f0ac9 in execute_sqlcom_select(THD*, TABLE_LIST*) ()
#12 0x00000000006f51c4 in mysql_execute_command(THD*) ()
#13 0x00000000006f8c48 in mysql_parse(THD*, char*, unsigned int, Parser_state*) ()
#14 0x00000000006f9f9b in dispatch_command(enum_server_command, THD*, char*, unsigned int) ()
#15 0x00000000006fbd97 in do_command(THD*) ()
#16 0x00000000006c39b6 in do_handle_one_connection(THD*) ()
#17 0x00000000006c3a95 in handle_one_connection ()
#18 0x0000000000ad9ae6 in pfs_spawn_thread ()
#19 0x0000003c1a8079d1 in start_thread () from /lib64/libpthread.so.0
#20 0x0000003c1a4e8b6d in clone () from /lib64/libc.so.6

線下環境復現如上情況,讓線程處於如上情況下時,確實無法被kill。
代碼如下

/*****************************************************************//**
The thread sleeps at least the time given in microseconds. */
void
os_thread_sleep(
/*============*/
	ulint	tm)	/*!< in: time in microseconds */
{
#ifdef _WIN32
	Sleep((DWORD) tm / 1000);
#elif defined(HAVE_NANOSLEEP)
	struct timespec	t;

	t.tv_sec = tm / 1000000;
	t.tv_nsec = (tm % 1000000) * 1000;

	::nanosleep(&t, NULL);
#else
	struct timeval  t;

	t.tv_sec = tm / 1000000;
	t.tv_usec = tm % 1000000;

	select(0, NULL, NULL, NULL, &t);
#endif /* _WIN32 */
}

而當時線上連接已經打滿,壓力過大導致使得如上所示的線程無法獲取進入innodb的權限,進入死等狀態。

可能會有疑問?比如說鎖等待的時候,一般使用kill的方法,解決問題。爲什麼可以kill成功呢?
如下,比如說一個ddl操作,被一個慢查詢阻塞。

Thread 55 (Thread 0x7f54d2d81700 (LWP 23454)):
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
#1  0x0000000001998675 in native_cond_timedwait (cond=0x7f54b0000d28, mutex=0x7f54b0000ce0, abstime=0x7f54d2d7cf40) at /data/mysql-5.7.20/include/thr_cond
.h:129
#2  0x00000000019989db in safe_cond_timedwait (cond=0x7f54b0000d28, mp=0x7f54b0000cb8, abstime=0x7f54d2d7cf40, file=0x20c6ae9 "/data/mysql-5.7.20/sql/mdl.cc", line=1861) at /data/mysql-5.7.20/mysys/thr_cond.c:88
#3  0x0000000001522616 in my_cond_timedwait (cond=0x7f54b0000d28, mp=0x7f54b0000cb8, abstime=0x7f54d2d7cf40, file=0x20c6ae9 "/data/mysql-5.7.20/sql/mdl.cc", line=1861) at /data/mysql-5.7.20/include/thr_cond.h:180
#4  0x0000000001522bc2 in inline_mysql_cond_timedwait (that=0x7f54b0000d28, mutex=0x7f54b0000cb8, abstime=0x7f54d2d7cf40, src_file=0x20c6ae9 "/data/mysql-5.7.20/sql/mdl.cc", src_line=1861) at /data/mysql-5.7.20/include/mysql/psi/mysql_thread.h:1229
#5  0x0000000001523df1 in MDL_wait::timed_wait (this=0x7f54b0000cb8, owner=0x7f54b0000c20, abs_timeout=0x7f54d2d7cf40, set_status_on_timeout=false, wait_state_name=0x2dd79c8 <MDL_key::m_namespace_to_wait_state_name+72>) at /data/mysql-5.7.20/sql/mdl.cc:1861
#6  0x0000000001525cfe in MDL_context::acquire_lock (this=0x7f54b0000cb8, mdl_request=0x7f54d2d7d010, lock_wait_timeout=31536000) at /data/mysql-5.7.20/sql/mdl.cc:3629
#7  0x0000000001526520 in MDL_context::upgrade_shared_lock (this=0x7f54b0000cb8, mdl_ticket=0x7f54b000dad0, new_type=MDL_EXCLUSIVE, lock_wait_timeout=31536000) at /data/mysql-5.7.20/sql/mdl.cc:3893
#8  0x00000000016ca804 in mysql_inplace_alter_table (thd=0x7f54b0000c20, table_list=0x7f54b00067c8, table=0x7f54b0015070, altered_table=0x7f54b00174f0, ha_alter_info=0x7f54d2d7d3e0, inplace_supported=HA_ALTER_INPLACE_NO_LOCK_AFTER_PREPARE, target_mdl_request=0x7f54d2d7d600, alter_ctx=0x7f54d2d7dd50) at /data/mysql-5.7.20/sql/sql_table.cc:7380
#9  0x00000000016d0230 in mysql_alter_table (thd=0x7f54b0000c20, new_db=0x7f54b0006d50 "ashe", new_name=0x0, create_info=0x7f54d2d7ed80, table_list=0x7f54b00067c8, alter_info=0x7f54d2d7ecd0) at /data/mysql-5.7.20/sql/sql_table.cc:9711
#10 0x000000000185a017 in Sql_cmd_alter_table::execute (this=0x7f54b0006e40, thd=0x7f54b0000c20) at /data/mysql-5.7.20/sql/sql_alter.cc:316
#11 0x0000000001632e98 in mysql_execute_command (thd=0x7f54b0000c20, first_level=true) at /data/mysql-5.7.20/sql/sql_parse.cc:4857
#12 0x0000000001634ecd in mysql_parse (thd=0x7f54b0000c20, parser_state=0x7f54d2d80550) at /data/mysql-5.7.20/sql/sql_parse.cc:5577
#13 0x0000000001629c73 in dispatch_command (thd=0x7f54b0000c20, com_data=0x7f54d2d80e00, command=COM_QUERY) at /data/mysql-5.7.20/sql/sql_parse.cc:1461
#14 0x0000000001628ac0 in do_command (thd=0x7f54b0000c20) at /data/mysql-5.7.20/sql/sql_parse.cc:999
#15 0x000000000176b481 in handle_connection (arg=0x516f010) at /data/mysql-5.7.20/sql/conn_handler/connection_handler_per_thread.cc:300
#16 0x0000000001e5afb9 in pfs_spawn_thread (arg=0x50c70d0) at /data/mysql-5.7.20/storage/perfschema/pfs.cc:2190
#17 0x00007f54f32d86ba in start_thread (arg=0x7f54d2d81700) at pthread_create.c:333
#18 0x00007f54f276d3dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
Thread 54 (Thread 0x7f54d2dc2700 (LWP 22177)):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
#1  0x000000000199869a in native_cond_wait (cond=0x2eb1920 <Per_thread_connection_handler::COND_thread_cache>, mutex=0x2eb18c8 <Per_thread_connection_handler::LOCK_thread_cache+40>) at /data/mysql-5.7.20/include/thr_cond.h:140
#2  0x0000000001998800 in safe_cond_wait (cond=0x2eb1920 <Per_thread_connection_handler::COND_thread_cache>, mp=0x2eb18a0 <Per_thread_connection_handler::LOCK_thread_cache>, file=0x2295c28 "/data/mysql-5.7.20/sql/conn_handler/connection_handler_per_thread.cc", line=145) at /data/mysql-5.7.20/mysys/thr_cond.c:48
#3  0x000000000176a7e8 in my_cond_wait (cond=0x2eb1920 <Per_thread_connection_handler::COND_thread_cache>, mp=0x2eb18a0 <Per_thread_connection_handler::LOCK_thread_cache>, file=0x2295c28 "/data/mysql-5.7.20/sql/conn_handler/connection_handler_per_thread.cc", line=145) at /data/mysql-5.7.20/include/thr_cond.h:193
#4  0x000000000176aaed in inline_mysql_cond_wait (that=0x2eb1920 <Per_thread_connection_handler::COND_thread_cache>, mutex=0x2eb18a0 <Per_thread_connection_handler::LOCK_thread_cache>, src_file=0x2295c28 "/data/mysql-5.7.20/sql/conn_handler/connection_handler_per_thread.cc", src_line=145) at /data/mysql-5.7.20/include/mysql/psi/mysql_thread.h:1184
#5  0x000000000176af85 in Per_thread_connection_handler::block_until_new_connection () at /data/mysql-5.7.20/sql/conn_handler/connection_handler_per_thread.cc:145
#6  0x000000000176b538 in handle_connection (arg=0x516f010) at /data/mysql-5.7.20/sql/conn_handler/connection_handler_per_thread.cc:329
#7  0x0000000001e5afb9 in pfs_spawn_thread (arg=0x50c70d0) at /data/mysql-5.7.20/storage/perfschema/pfs.cc:2190
#8  0x00007f54f32d86ba in start_thread (arg=0x7f54d2dc2700) at pthread_create.c:333
#9  0x00007f54f276d3dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

如果是常規的上線操作時,遇到這種情況,dba會選擇先kill掉ddl,查到慢sql之後,再做下一步處理。

鎖等待定爲到函數

MDL_wait::timed_wait(MDL_context_owner *owner, struct timespec *abs_timeout,
                     bool set_status_on_timeout,
                     const PSI_stage_info *wait_state_name)
{
  PSI_stage_info old_stage;
  enum_wait_status result;
  int wait_result= 0;

  mysql_mutex_lock(&m_LOCK_wait_status);

  owner->ENTER_COND(&m_COND_wait_status, &m_LOCK_wait_status,
                    wait_state_name, & old_stage);
  thd_wait_begin(NULL, THD_WAIT_META_DATA_LOCK);
  while (!m_wait_status && !owner->is_killed() &&
         wait_result != ETIMEDOUT && wait_result != ETIME)
  {
    wait_result= mysql_cond_timedwait(&m_COND_wait_status, &m_LOCK_wait_status,
                                      abs_timeout);
  }
  thd_wait_end(NULL);

  if (m_wait_status == EMPTY)
  {
    /*
      Wait has ended not due to a status being set from another
      thread but due to this connection/statement being killed or a
      time out.
      To avoid races, which may occur if another thread sets
      GRANTED status before the code which calls this method
      processes the abort/timeout, we assign the status under
      protection of the m_LOCK_wait_status, within the critical
      section. An exception is when set_status_on_timeout is
      false, which means that the caller intends to restart the
      wait.
    */
    if (owner->is_killed())
      m_wait_status= KILLED;
    else if (set_status_on_timeout)
      m_wait_status= TIMEOUT;
  }
  result= m_wait_status;

  mysql_mutex_unlock(&m_LOCK_wait_status);
  owner->EXIT_COND(& old_stage);

  return result;
}

ddl可以立刻生效的原因在於,ddl在進行鎖等待時,內部循環不停的檢測自己是否被kill掉。

關於kill操作的改進措施

使用kill query代替kill操作,kill默認是kill connection,這會是的客戶端立刻感知到查詢失敗,有可能會再次鏈接發送相同的慢查詢。如果此時數據庫內部又沒有及時釋放被kill的線程,則會出現越kill,線程越多的情況。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章