Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-6825

mysqld server "deadlocks" after issuing STOP SLAVE command while replicated DELETE query is executing (freeing items vs killing slave)

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 5.5.37, 5.5.39
    • Fix Version/s: 5.5.43
    • Environment:

      Description

      Hello,

      This problem I noticed after replacing MySQL 5.5.x with MariaDB 5.5.
      It occured on both MariaDB 5.5.39, 5.5.37 as well as some older MariaDB 5.5 versions.

      It is possible to render MariaDB instance unusable by putting it into some kind of internal "deadlock state":

      How to reproduce:

      1. Use master-slave setup with statement based replication and InnoDB as storage engine.
      2. Wait until slave server replicates and executes a DELETE query replicated from master.
      3. Issue "STOP SLAVE" command on the slave server before execution of the DELETE query is completed by the replication sql_thread.

      Effect:
      MariaDB on slave server will enter "deadlock state" (kind of):

      • STOP SLAVE stalls on State: "Killing slave"
      • DELETE query stalls on State: "Freeing items"
      • CPU load generated by mysqld drops to nearly 0,
      • IO load generated by mysqld drops to 0,

      Last entry in mysq.err:

      140909  3:05:02 [Note] Slave SQL thread exiting, replication stopped in log 'mysql-bin.034913' at position 53766638
      

      Output head from "SHOW PROCESSLIST" (after "deadlock state" is in place for a few hours here, normally our DELETE queries take much less time to complete):

       
      Id      User    Host    db      Command Time    State   Info    Progress
      24611   system user             NULL    Connect 89203   Waiting for master to send event        NULL    0.000
      24612   system user             NULL    Connect 17161   freeing items   NULL    0.000
      35399   [user1-namechanged]  [ip-removed]:43814    NULL    Query   15421   Killing slave   stop slave      0.000
      35400   [user2-namechanged]       [ip-removed]:50631    NULL    Query   15420   executing       SHOW /*!50002 GLOBAL */ STATUS  0.000
      35401   [user2-namechanged]       [ip-removed]:51717    NULL    Sleep   15396           NULL    0.000
      35402   [user2-namechanged]       [ip-removed]:51718    NULL    Sleep   15371           NULL    0.000
      35403   [user3-namechanged] [ip-removed]:52953    NULL    Query   15347   executing       show global status      0.000
      35404   [user2-namechanged]       [ip-removed]:51719    NULL    Sleep   15346           NULL    0.000
      35405   [user2-namechanged]       [ip-removed]:51720    NULL    Sleep   15321           NULL    0.000
      

      (full processlist attached)

      Output head from running Poor Man's Profiler on mysqld in the "deadlock state":

          855 __lll_lock_wait,_L_lock_1172,pthread_mutex_lock,inline_mysql_mutex_lock,at,at,one_thread_per_connection_end,do_handle_one_connection,handle_one_connection,start_thread,clone,??
            2 __lll_lock_wait,_L_lock_1172,pthread_mutex_lock,inline_mysql_mutex_lock,at,mysql_parse,dispatch_command,do_command,do_handle_one_connection,handle_one_connection,start_thread,clone,??
            1 read,vio_read,vio_read_buff,my_real_read,my_net_read,cli_safe_read,handle_slave_io,start_thread,clone,??
            1 pthread_cond_wait@@GLIBC_2.3.2,os_cond_wait,reset_sig_count=5),srv_purge_thread,start_thread,clone,??
            1 pthread_cond_wait@@GLIBC_2.3.2,os_cond_wait,reset_sig_count=43),srv_master_thread,start_thread,clone,??
            1 pthread_cond_timedwait@@GLIBC_2.3.2,os_cond_wait_timed,time_in_usec=<value,srv_monitor_thread,start_thread,clone,??
            1 pthread_cond_timedwait@@GLIBC_2.3.2,os_cond_wait_timed,time_in_usec=<value,srv_LRU_dump_restore_thread,start_thread,clone,??
            1 pthread_cond_timedwait@@GLIBC_2.3.2,os_cond_wait_timed,time_in_usec=<value,srv_lock_timeout_thread,start_thread,clone,??
            1 pthread_cond_timedwait@@GLIBC_2.3.2,os_cond_wait_timed,time_in_usec=<value,srv_error_monitor_thread,start_thread,clone,??
            1 pthread_cond_timedwait@@GLIBC_2.3.2,inline_mysql_cond_timedwait,term_lock=0x7f7a1cb26d48,,terminate_slave_threads,stop_slave,mysql_execute_command,mysql_parse,dispatch_command,do_command,do_handle_one_connection,handle_one_connection,start_thread,clone,??
            1 pthread_cond_timedwait@@GLIBC_2.3.2,inline_mysql_cond_timedwait,sleep_time=<value,ma_checkpoint_background,start_thread,clone,??
            1 poll,handle_connections_sockets,mysqld_main,__libc_start_main,_start
            1 __lll_lock_wait,_L_lock_1172,pthread_mutex_lock,inline_mysql_mutex_lock,tables=0x7f8e36dd2338,,get_schema_tables_result,JOIN::exec,mysql_select,handle_select,execute_sqlcom_select,execute_show_status,at,mysql_parse,dispatch_command,do_command,do_handle_one_connection,handle_one_connection,start_thread,clone,??
            1 __lll_lock_wait,_L_lock_1172,pthread_mutex_lock,inline_mysql_mutex_lock,tables=0x7f8e36cd5338,,get_schema_tables_result,JOIN::exec,mysql_select,handle_select,execute_sqlcom_select,execute_show_status,at,mysql_parse,dispatch_command,do_command,do_handle_one_connection,handle_one_connection,start_thread,clone,??
      

      (full output attached)

        Gliffy Diagrams

          Attachments

            Activity

            Hide
            elenst Elena Stepanova added a comment -

            Hi,

            Could you please also attach your cnf files from the master and slave or output of SHOW GLOBAL VARIABLES if you so prefer?
            Please also indicate with which binlog format the DELETE is written with, if you can.
            If it happens on a particular table and/or with a particular DELETE query, the table structure and an idea of how many rows are being deleted could also help to reproduce the problem.

            Thanks.

            Show
            elenst Elena Stepanova added a comment - Hi, Could you please also attach your cnf files from the master and slave or output of SHOW GLOBAL VARIABLES if you so prefer? Please also indicate with which binlog format the DELETE is written with, if you can. If it happens on a particular table and/or with a particular DELETE query, the table structure and an idea of how many rows are being deleted could also help to reproduce the problem. Thanks.
            Hide
            maczal Maciej Zalewski added a comment -

            Hello,

            I've attached my.cnf files from both master and slave.
            In this particular case master was not upgraded and is MySQL 5.5.28, slave is MariaDB 5.5.39

            binlog_format is set to 'STATEMENT'

            Problem occured on a few servers with different database schema. These are deletes that affect multiple rows (up to 1000 rows per delete), In this particular case rows in delete query are specified by primary key, which consists of
            2 columns: (int(10) unsigned NOT NULL, binary(12) NOT NULL ) and is in the form:

            DELETE FROM t1 WHERE (c1=3345 AND c2='1234567890ab' ) OR (c1=111111 AND c2='abcABCabc+++' ) OR ...

            Volume of this particular table is varying every day between 20.000 and 500.000 with many rows being inserted and deleted.

            Show
            maczal Maciej Zalewski added a comment - Hello, I've attached my.cnf files from both master and slave. In this particular case master was not upgraded and is MySQL 5.5.28, slave is MariaDB 5.5.39 binlog_format is set to 'STATEMENT' Problem occured on a few servers with different database schema. These are deletes that affect multiple rows (up to 1000 rows per delete), In this particular case rows in delete query are specified by primary key, which consists of 2 columns: (int(10) unsigned NOT NULL, binary(12) NOT NULL ) and is in the form: DELETE FROM t1 WHERE (c1=3345 AND c2='1234567890ab' ) OR (c1=111111 AND c2='abcABCabc+++' ) OR ... Volume of this particular table is varying every day between 20.000 and 500.000 with many rows being inserted and deleted.
            Hide
            elenst Elena Stepanova added a comment -

            I've been trying to reproduce the problem, and haven't succeeded so far; but here are two observations:

            1) with 500,000 rows and a 1000-value DELETE statement, even on my modest local machine it's nearly impossible to catch the DELETE statement, the DELETE finishes almost momentarily, generally in less than a second. I had to raise the numbers to 1,000,000 rows and 10,000 values correspondingly until I could reliably issue STOP SLAVE while DELETE was running;

            2) even with these increased numbers, I cannot actually see the state 'Freeing items', neither when the event is executed normally, nor when STOP SLAVE is waiting for it. I only see 'updating', apparently 'freeing items' is too fast.

            So, the questions:

            How long does the DELETE take normally, when you don't run STOP SLAVE?

            When you reproduce the problem and run STOP SLAVE, do you do it when the DELETE is already in 'Freeing items' state, or does it reach it later?

            Thanks.

            Show
            elenst Elena Stepanova added a comment - I've been trying to reproduce the problem, and haven't succeeded so far; but here are two observations: 1) with 500,000 rows and a 1000-value DELETE statement, even on my modest local machine it's nearly impossible to catch the DELETE statement, the DELETE finishes almost momentarily, generally in less than a second. I had to raise the numbers to 1,000,000 rows and 10,000 values correspondingly until I could reliably issue STOP SLAVE while DELETE was running; 2) even with these increased numbers, I cannot actually see the state 'Freeing items', neither when the event is executed normally, nor when STOP SLAVE is waiting for it. I only see 'updating', apparently 'freeing items' is too fast. So, the questions: How long does the DELETE take normally, when you don't run STOP SLAVE? When you reproduce the problem and run STOP SLAVE, do you do it when the DELETE is already in 'Freeing items' state, or does it reach it later? Thanks.
            Hide
            maczal Maciej Zalewski added a comment - - edited

            Unfortunately it seems I am not able to replicate the issue any longer for some reason.

            As for your questions:

            1) The delete takes about 10-20 seconds to complete. This is one of the smaller table in the database and not so frequently used (a serie of INSERTs once an hour, then a SELECT and a serie of DELETEs also once per hour). There are other tables, much bigger in size, accessed many times per second so I guess the table data and index pages get rotated out of the innodb buffer pool pretty quickly.

            The table size:

            mysql> SELECT DATA_LENGTH, TABLE_ROWS, INDEX_LENGTH FROM information_schema.TABLES where TABLE_NAME='[tablename-removed]'\G
            
             DATA_LENGTH: 138477568
              TABLE_ROWS: 44524
            INDEX_LENGTH: 13647872
            1 row in set (0.00 sec)
            

            The database size:

            mysql> SELECT COUNT(*), SUM(DATA_LENGTH), SUM(TABLE_ROWS), SUM(INDEX_LENGTH) FROM information_schema.TABLES\G
            
            COUNT(*): 268                                                                                                                      
            SUM(DATA_LENGTH): 173926005611                                                                                                             
              SUM(TABLE_ROWS): 988327726                                                                                                                
            SUM(INDEX_LENGTH): 122425886720                                                                                                             
            1 row in set (3.72 sec)
            
            mysql> show global status like 'Uptime%';
            +---------------------------+---------+
            | Variable_name             | Value   |
            +---------------------------+---------+
            | Uptime                    | 2048546 |
            | Uptime_since_flush_status | 2048546 |
            +---------------------------+---------+
            
            mysql> show global status like 'Queries%';
            +---------------+-----------+
            | Variable_name | Value     |
            +---------------+-----------+
            | Queries       | 353304375 |
            +---------------+-----------+
            

            2) The STOP SLAVE was originally issued by a crontab script. I am no longer able to reproduce it myself for some reason.

            Show
            maczal Maciej Zalewski added a comment - - edited Unfortunately it seems I am not able to replicate the issue any longer for some reason. As for your questions: 1) The delete takes about 10-20 seconds to complete. This is one of the smaller table in the database and not so frequently used (a serie of INSERTs once an hour, then a SELECT and a serie of DELETEs also once per hour). There are other tables, much bigger in size, accessed many times per second so I guess the table data and index pages get rotated out of the innodb buffer pool pretty quickly. The table size: mysql> SELECT DATA_LENGTH, TABLE_ROWS, INDEX_LENGTH FROM information_schema.TABLES where TABLE_NAME='[tablename-removed]'\G DATA_LENGTH: 138477568 TABLE_ROWS: 44524 INDEX_LENGTH: 13647872 1 row in set (0.00 sec) The database size: mysql> SELECT COUNT(*), SUM(DATA_LENGTH), SUM(TABLE_ROWS), SUM(INDEX_LENGTH) FROM information_schema.TABLES\G COUNT(*): 268 SUM(DATA_LENGTH): 173926005611 SUM(TABLE_ROWS): 988327726 SUM(INDEX_LENGTH): 122425886720 1 row in set (3.72 sec) mysql> show global status like 'Uptime%'; +---------------------------+---------+ | Variable_name | Value | +---------------------------+---------+ | Uptime | 2048546 | | Uptime_since_flush_status | 2048546 | +---------------------------+---------+ mysql> show global status like 'Queries%'; +---------------+-----------+ | Variable_name | Value | +---------------+-----------+ | Queries | 353304375 | +---------------+-----------+ 2) The STOP SLAVE was originally issued by a crontab script. I am no longer able to reproduce it myself for some reason.
            Hide
            elenst Elena Stepanova added a comment -

            Given that here we also had concurrent execution of STOP SLAVE and SHOW GLOBAL STATUS, I think it's reasonable to believe it's the same issue as MDEV-7126 which was fixed in 5.5.43.

            Show
            elenst Elena Stepanova added a comment - Given that here we also had concurrent execution of STOP SLAVE and SHOW GLOBAL STATUS , I think it's reasonable to believe it's the same issue as MDEV-7126 which was fixed in 5.5.43.

              People

              • Assignee:
                serg Sergei Golubchik
                Reporter:
                maczal Maciej Zalewski
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: