Details
-
Type:
Bug
-
Status: Stalled
-
Priority:
Major
-
Resolution: Unresolved
-
Affects Version/s: 10.0.17
-
Fix Version/s: None
-
Component/s: Replication
-
Labels:
-
Environment:rhel6 10.0.17 master and slave
Description
from MDEV-7126 - same symptoms - different case
reproduced on 10.0.17:
its quite likely this had an active IO thread and stopped SQL thread (was running pt-slave-delay --delay 24h --interval 1hr)
mysql -S /var/lib/mariadb/mysql-10.sock Welcome to the MariaDB monitor. Commands end with ; or \g. Your MariaDB connection id is 1438 Server version: 10.0.17-MariaDB-log MariaDB Server Copyright (c) 2000, 2015, Oracle, MariaDB Corporation Ab and others. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. MariaDB [(none)]> stop slave; set global slave_domain_parallel_threads=15; start slave;
]$ mysql -u openquery -S /var/lib/mariadb/mysql-10.sock Welcome to the MariaDB monitor. Commands end with ; or \g. Your MariaDB connection id is 1443 Server version: 10.0.17-MariaDB-log MariaDB Server Copyright (c) 2000, 2015, Oracle, MariaDB Corporation Ab and others. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. MariaDB [(none)]> show processlist; +------+-------------+-----------+--------------------+---------+--------+--------------------------------------------------------------------------------+--------------------+----------+ | Id | User | Host | db | Command | Time | State | Info | Progress | +------+-------------+-----------+--------------------+---------+--------+--------------------------------------------------------------------------------+--------------------+----------+ | 3 | system user | | NULL | Connect | 264856 | Waiting for work from SQL thread | NULL | 0.000 | | 4 | system user | | NULL | Connect | 264856 | Waiting for work from SQL thread | NULL | 0.000 | | 5 | system user | | NULL | Connect | 264856 | Waiting for work from SQL thread | NULL | 0.000 | | 6 | system user | | NULL | Connect | 264856 | Waiting for work from SQL thread | NULL | 0.000 | | 7 | system user | | NULL | Connect | 264856 | Waiting for work from SQL thread | NULL | 0.000 | | 8 | system user | | NULL | Connect | 264856 | Waiting for work from SQL thread | NULL | 0.000 | | 9 | system user | | NULL | Connect | 264856 | Waiting for work from SQL thread | NULL | 0.000 | | 10 | system user | | NULL | Connect | 264856 | Waiting for work from SQL thread | NULL | 0.000 | | 11 | system user | | NULL | Connect | 264856 | Waiting for work from SQL thread | NULL | 0.000 | | 12 | system user | | NULL | Connect | 264856 | Waiting for work from SQL thread | NULL | 0.000 | | 13 | system user | | NULL | Connect | 114397 | Waiting for prior transaction to start commit before starting next transaction | NULL | 0.000 | | 14 | system user | | NULL | Connect | 114397 | Waiting for prior transaction to start commit before starting next transaction | NULL | 0.000 | | 15 | system user | | NULL | Connect | 114397 | Waiting for prior transaction to start commit before starting next transaction | NULL | 0.000 | | 16 | system user | | NULL | Connect | 114396 | Waiting for prior transaction to start commit before starting next transaction | NULL | 0.000 | | 17 | system user | | NULL | Connect | 114398 | Waiting for prior transaction to start commit before starting next transaction | NULL | 0.000 | | 18 | system user | | NULL | Connect | 114396 | Waiting for prior transaction to start commit before starting next transaction | NULL | 0.000 | | 19 | system user | | NULL | Connect | 114398 | Waiting for prior transaction to start commit before starting next transaction | NULL | 0.000 | | 20 | system user | | NULL | Connect | 114398 | Waiting for prior transaction to start commit before starting next transaction | NULL | 0.000 | | 21 | system user | | NULL | Connect | 114396 | Waiting for prior transaction to start commit before starting next transaction | NULL | 0.000 | | 22 | system user | | NULL | Connect | 114397 | Waiting for prior transaction to start commit before starting next transaction | NULL | 0.000 | | 24 | system user | | NULL | Connect | 264856 | Waiting for master to send event | NULL | 0.000 | | 39 | openquery | localhost | NULL | Sleep | 907 | | NULL | 0.000 | | 1349 | system user | | NULL | Connect | 26107 | Reading event from the relay log | NULL | 0.000 | | 1438 | root | localhost | NULL | Query | 147 | Killing slave | stop slave | 0.000 | | 1440 | munin | localhost | information_schema | Query | 76 | Filling schema table | SHOW GLOBAL STATUS | 0.000 | | 1441 | root | localhost | NULL | Query | 41 | init | show slave status | 0.000 | | 1442 | munin | localhost | information_schema | Query | 14 | Filling schema table | SHOW GLOBAL STATUS | 0.000 | | 1443 | openquery | localhost | NULL | Query | 0 | init | show processlist | 0.000 | +------+-------------+-----------+--------------------+---------+--------+--------------------------------------------------------------------------------+--------------------+----------+
SQL thread was stopped - set password on non-existent user.
MariaDB [(none)]> set global sql_slave_skip_counter=1; start slave; ERROR 1198 (HY000): This operation cannot be performed as you have a running slave ''; run STOP SLAVE '' first Query OK, 0 rows affected, 1 warning (0.00 sec) MariaDB [(none)]> stop slave;set global sql_slave_skip_counter=1; start slave;
erver version: 10.0.17-MariaDB-log MariaDB Server Copyright (c) 2000, 2015, Oracle, MariaDB Corporation Ab and others. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. MariaDB [(none)]> show processlist; | Id | User | Host | db | Command | Time | State | Info | Progress | | 3 | system user | | NULL | Connect | 1010764 | Waiting for work from SQL thread | NULL | 0.000 | | 4 | system user | | NULL | Connect | 1010764 | Waiting for work from SQL thread | NULL | 0.000 | | 5 | system user | | NULL | Connect | 1010764 | Waiting for work from SQL thread | NULL | 0.000 | | 6 | system user | | NULL | Connect | 1010764 | Waiting for work from SQL thread | NULL | 0.000 | | 7 | system user | | NULL | Connect | 1010764 | Waiting for work from SQL thread | NULL | 0.000 | | 8 | system user | | NULL | Connect | 944326 | Waiting for prior transaction to start commit before starting next transaction | NULL | 0.000 | | 9 | system user | | NULL | Connect | 944324 | Waiting for prior transaction to start commit before starting next transaction | NULL | 0.000 | | 10 | system user | | NULL | Connect | 944327 | Waiting for prior transaction to start commit before starting next transaction | NULL | 0.000 | | 11 | system user | | NULL | Connect | 944326 | Waiting for prior transaction to start commit before starting next transaction | NULL | 0.000 | | 12 | system user | | NULL | Connect | 944327 | Waiting for prior transaction to start commit before starting next transaction | NULL | 0.000 | | 13 | system user | | NULL | Connect | 944326 | Waiting for prior transaction to start commit before starting next transaction | NULL | 0.000 | | 14 | system user | | NULL | Connect | 944325 | Waiting for prior transaction to start commit before starting next transaction | NULL | 0.000 | | 15 | system user | | NULL | Connect | 944328 | Waiting for prior transaction to start commit before starting next transaction | NULL | 0.000 | | 16 | system user | | NULL | Connect | 944328 | Waiting for prior transaction to start commit before starting next transaction | NULL | 0.000 | | 17 | system user | | NULL | Connect | 944324 | Waiting for prior transaction to start commit before starting next transaction | NULL | 0.000 | | 18 | system user | | NULL | Connect | 944328 | Waiting for prior transaction to start commit before starting next transaction | NULL | 0.000 | | 19 | system user | | NULL | Connect | 944324 | Waiting for prior transaction to start commit before starting next transaction | NULL | 0.000 | | 20 | system user | | NULL | Connect | 944326 | Waiting for prior transaction to start commit before starting next transaction | NULL | 0.000 | | 21 | system user | | NULL | Connect | 944324 | Waiting for prior transaction to start commit before starting next transaction | NULL | 0.000 | | 22 | system user | | NULL | Connect | 944324 | Waiting for prior transaction to start commit before starting next transaction | NULL | 0.000 | | 23 | system user | | NULL | Connect | 1010763 | Waiting for master to send event | NULL | 0.000 | | 28 | openquery | localhost | NULL | Killed | 127489 | init | SHOW SLAVE STATUS | 0.000 | | 291 | system user | | NULL | Connect | 854691 | Reading event from the relay log | NULL | 0.000 | | 2711 | openquery | localhost | NULL | Query | 129024 | Killing slave | stop slave | 0.000 | | 2715 | munin | localhost | information_schema | Killed | 128765 | Filling schema table | SHOW GLOBAL STATUS | 0.000 | | 2716 | root | localhost | NULL | Killed | 128709 | init | show slave status | 0.000 | | 2719 | munin | localhost | information_schema | Killed | 128703 | Filling schema table | SHOW GLOBAL STATUS | 0.000 | | 2741 | openquery | localhost | NULL | Query | 0 | init | show processlist | 0.000 | 28 rows in set (0.00 sec)
Threads where well and truly halted before i killed off the global status threads.
(gdb) where #0 0x000000348e40b98e in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x000000000056167f in terminate_slave_thread () #2 0x0000000000564d45 in terminate_slave_threads(Master_info*, int, bool) () #3 0x00000000005efabf in stop_slave(THD*, Master_info*, bool) () #4 0x00000000005d3fe2 in mysql_execute_command(THD*) () #5 0x00000000005d8552 in mysql_parse () #6 0x00000000005da7b0 in dispatch_command(enum_server_command, THD*, char*, unsigned int) () #7 0x00000000006963d3 in do_handle_one_connection(THD*) () #8 0x00000000006964a2 in handle_one_connection () #9 0x000000348e4079d1 in start_thread () from /lib64/libpthread.so.0 #10 0x000000348e0e8b6d in clone () from /lib64/libc.so.6
full bt in gdb-10.0-terminate_slave_thread.txt
Analysis
Positions are from objdump on the executable
Looking at the show slave status this is at the pthread_mutex_lock position 0x63551a in fill_status though a copy show pthread_mutex lock is from fill_status(635419)->show_status_array(62a94c)->show_heartbeat_period(54a982).
show_heartbeat_period is blocked acquiring the LOCK_active_mi.
fill_status(63551a) is waiting on LOCK_show_status just before show_status_array which means its waiting on the fill_status that reached show_heartbeat_period.
0x5d5953 in mysql_execute_command is waiting on a lock LOCK_active_mi for a show slave status command.
terminate_slave_thread looking at the disassembled code the following is the only cond_timedwait
mysql_cond_timedwait(term_cond, term_lock, &abstime)
and its waiting here.
thd->LOCK_thd_data is unlocked.
term_lock is sql_lock (in terminate_slave_threads) based on the position in the terminate_slave_threads, which is (mi->rli.run_lock) (which is locked) in stop_slave (by calling lock_slave_thread).
term_cond is mi->rli.stop_cond
We can see this is terminating the sql thread.
In both stack traces have handle_slave_sql(line 4676) calling in rpl_parallel::wait_for_done(6d52ec) at mysql_cond_wait(&e->COND_parallel_entry, &rpt->LOCK_rpl_thread); gdb-10.0-terminate_slave_thread.txt has two of these threads in the same place which strikes me as odd (wasn't multisource).
So the blocking rpl_parallel::wait_for_done at a lock point and is preventing the {{mysql_cond_broadcast(&rli->stop_cond); }}at the bottom of the handle_slave_sql thread.
So the point of being stuck is at rpl_parallel::wait_for_done(6d52ec) at mysql_cond_wait(&e->COND_parallel_entry, &rpt->LOCK_rpl_thread)
Gliffy Diagrams
Attachments
Issue Links
Activity
- All
- Comments
- Work Log
- History
- Activity
- Transitions
Slave status locks fixed in
MDEV-7126https://github.com/MariaDB/server/commit/f8320210e78232fed350754faf72d14af63d0c3e