Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-8311

Replication slave is stuck without any error

    Details

      Description

      When execute show slave status, It seems every thing is OK, but actually slave's sql thread was stuck! The Exec_Master_Log_Pos(mysql-bin.000160 2140880879) didn't change for several days.

      MariaDB [(none)]> show slave status\G;
      *************************** 1. row ***************************
                     Slave_IO_State: Waiting for master to send event
                        Master_Host: 192.168.1.113
                        Master_User: repl
                        Master_Port: 5518
                      Connect_Retry: 60
                    Master_Log_File: mysql-bin.000282
                Read_Master_Log_Pos: 531076432
                     Relay_Log_File: relay-bin.000002
                      Relay_Log_Pos: 691
              Relay_Master_Log_File: mysql-bin.000160
                   Slave_IO_Running: Yes
                  Slave_SQL_Running: Yes
                    Replicate_Do_DB: 
                Replicate_Ignore_DB: 
                 Replicate_Do_Table: 
             Replicate_Ignore_Table: 
            Replicate_Wild_Do_Table: 
        Replicate_Wild_Ignore_Table: 
                         Last_Errno: 0
                         Last_Error: 
                       Skip_Counter: 0
                Exec_Master_Log_Pos: 2140880879
                    Relay_Log_Space: 124435203348
                    Until_Condition: None
                     Until_Log_File: 
                      Until_Log_Pos: 0
                 Master_SSL_Allowed: No
                 Master_SSL_CA_File: 
                 Master_SSL_CA_Path: 
                    Master_SSL_Cert: 
                  Master_SSL_Cipher: 
                     Master_SSL_Key: 
              Seconds_Behind_Master: 283562
      Master_SSL_Verify_Server_Cert: No
                      Last_IO_Errno: 0
                      Last_IO_Error: 
                     Last_SQL_Errno: 0
                     Last_SQL_Error: 
        Replicate_Ignore_Server_Ids: 
                   Master_Server_Id: 1895939662
                     Master_SSL_Crl: 
                 Master_SSL_Crlpath: 
                         Using_Gtid: Current_Pos
                        Gtid_IO_Pos: 0-1895939662-13106380
      1 row in set (0.00 sec)
      

      I use parallel replication, and set SLAVE_PARALLEL_THREADS=16, I found one work thread is always 'Waiting for prior transaction to start commit before starting next transaction', other 15 work threads is always 'Waiting for work from SQL thread',and show engine innodb status, there's no locks.
      Then check error log, find a restart in log 'mysql-bin.000160' at position 2140880879 :

      150606 10:32:34 [Note] /home/mariadb/bin/mysqld: Normal shutdown
      recording                                                                                                                                  60385,1       86%
      150605 18:42:13 [Warning] IP address '192.168.1.117' could not be resolved: Temporary failure in name resolution
      150606 10:32:34 [Note] /home/mariadb/bin/mysqld: Normal shutdown
      
      150606 10:32:34 [Note] Event Scheduler: Purging the queue. 0 events
      150606 10:36:51 [Note] Slave: received end packet from server, apparent master shutdown:
      150606 10:36:51 [Note] Slave I/O thread killed while reading event
      150606 10:36:51 [Note] Slave I/O thread exiting, read up to log 'mysql-bin.000260', position 353423150; GTID position 0-1895939662-1585424
      150606 10:44:24 [Note] InnoDB: Using mutexes to ref count buffer pool pages
      150606 10:44:24 [Note] InnoDB: The InnoDB memory heap is disabled
      150606 10:44:24 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
      150606 10:44:24 [Note] InnoDB: Compressed tables use zlib 1.2.3
      150606 10:44:24 [Note] InnoDB: Using Linux native AIO
      150606 10:44:24 [Note] InnoDB: Using CPU crc32 instructions
      150606 10:44:24 [Note] InnoDB: Initializing buffer pool, size = 78.1G
      150606 10:44:27 [Note] InnoDB: Completed initialization of buffer pool
      150606 10:44:28 [Note] InnoDB: Highest supported file format is Barracuda.
      150606 10:44:28 [Note] InnoDB: The log sequence numbers 70561972035 and 70561972035 in ibdata files do not match the log sequence number 121831407098 in the ib_logfiles!
      150606 10:44:28 [Note] InnoDB: Database was not shutdown normally!
      150606 10:44:28 [Note] InnoDB: Starting crash recovery.
      150606 10:44:28 [Note] InnoDB: Reading tablespace information from the .ibd files...
      150606 10:44:28 [Note] InnoDB: Restoring possible half-written data pages
      150606 10:44:28 [Note] InnoDB: from the doublewrite buffer...
      InnoDB: 1 transaction(s) which must be rolled back or cleaned up
      InnoDB: in total 4554 row operations to undo
      InnoDB: Trx id counter is 28084736
      InnoDB: Last MySQL binlog file position 0 708607516, file name ../binlog/mysql-bin.000179
      150606 10:44:29 [Note] InnoDB: 128 rollback segment(s) are active.
      InnoDB: Starting in background the rollback of uncommitted transactions
      2015-06-06 10:44:29 7f1c2d7ff700  InnoDB: Rolling back trx with id 28084351, 4554 rows to undo
      
      InnoDB: Progress in percents: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100150606 10:44:30 [Note] InnoDB: Rollback of trx with id 28084351 completed
      2015-06-06 10:44:30 7f1c2d7ff700  InnoDB: Rollback of non-prepared transactions completed
      150606 10:44:34 [Note] InnoDB: Waiting for purge to start
      150606 10:44:34 [Note] InnoDB:  Percona XtraDB (http://www.percona.com) 5.6.17-65.0 started; log sequence number 121831407098
      150606 10:44:34 [Note] Semi-sync replication initialized for transactions.
      150606 10:44:34 [Note] Semi-sync replication enabled on the master.
      150606 10:44:34 server_audit: MariaDB Audit Plugin version 1.1.7 STARTED.
      150606 10:44:34 [Note] Recovering after a crash using ../binlog/mysql-bin
      150606 10:44:34 [Note] Starting crash recovery...
      150606 10:44:34 [Note] Crash recovery finished.
      150606 10:44:34 [Note] Server socket created on IP: '192.168.1.114'.
      150606 10:44:34 [Note] Event Scheduler: Loaded 0 events
      150606 10:44:46 [Note] /home/mariadb/bin/mysqld: Normal shutdown
      
      150606 10:44:50 [Note] /home/mariadb/bin/mysqld: ready for connections.
      Version: '10.0.12-MariaDB-log'  socket: '/home/mariadb/bin/mysql1.sock'  port: 5518  Source distribution
      150606 10:44:50 [Note] Event Scheduler: Purging the queue. 0 events
      150606 10:44:50 [Note] Slave I/O thread: Start semi-sync replication to master 'repl@192.168.1.113:5518' in log 'mysql-bin.000160' at position 2140880879
      150606 10:44:50 [Note] Slave I/O thread killed while connecting to master
      150606 10:44:50 [Note] Slave I/O thread exiting, read up to log 'mysql-bin.000160', position 2140880879; GTID position 0-1895939662-1128069
      150606 10:44:50 [Note] Slave SQL thread initialized, starting replication in log 'mysql-bin.000160' at position 2140880879, relay log '../relaylog/relay-bin.000001' position: 4; GTID position '0-1895939662-1128069'
      150606 10:44:50 [Note] Slave SQL thread exiting, replication stopped in log 'mysql-bin.000160' at position 2140880879; GTID position '0-1895939662-1128069'
      150606 10:44:50 server_audit: STOPPED
      150606 10:44:50 [Note] unregister_replicator OK
      150606 10:44:50 [Note] InnoDB: FTS optimize thread exiting.
      150606 10:44:50 [Note] InnoDB: Starting shutdown...
      150606 10:44:55 [Note] InnoDB: Shutdown completed; log sequence number 121833466167
      150606 10:44:55 [Note] /home/mariadb/bin/mysqld: Shutdown complete
      
      150606 10:45:12 [Note] InnoDB: Using mutexes to ref count buffer pool pages
      150606 10:45:12 [Note] InnoDB: The InnoDB memory heap is disabled
      150606 10:45:12 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
      150606 10:45:12 [Note] InnoDB: Compressed tables use zlib 1.2.3
      150606 10:45:12 [Note] InnoDB: Using Linux native AIO
      150606 10:45:12 [Note] InnoDB: Using CPU crc32 instructions
      150606 10:45:12 [Note] InnoDB: Initializing buffer pool, size = 78.1G
      150606 10:45:15 [Note] InnoDB: Completed initialization of buffer pool
      150606 10:45:15 [Note] InnoDB: Highest supported file format is Barracuda.
      150606 10:45:16 [Note] InnoDB: 128 rollback segment(s) are active.
      150606 10:45:16 [Note] InnoDB: Waiting for purge to start
      150606 10:45:16 [Note] InnoDB:  Percona XtraDB (http://www.percona.com) 5.6.17-65.0 started; log sequence number 121833466167
      150606 10:45:16 [Note] Semi-sync replication initialized for transactions.
      150606 10:45:16 [Note] Semi-sync replication enabled on the master.
      150606 10:45:16 server_audit: MariaDB Audit Plugin version 1.1.7 STARTED.
      150606 10:45:16 [Note] Server socket created on IP: '192.168.1.114'.
      150606 10:45:16 [Note] Event Scheduler: Loaded 0 events
      150606 10:45:16 [Note] /home/mariadb/bin/mysqld: ready for connections.
      Version: '10.0.12-MariaDB-log'  socket: '/home/mariadb/bin/mysql1.sock'  port: 5518  Source distribution
      150606 10:45:16 [Note] Slave I/O thread: Start semi-sync replication to master 'repl@192.168.1.113:5518' in log 'mysql-bin.000160' at position 2140880879
      150606 10:45:16 [Note] Slave SQL thread initialized, starting replication in log 'mysql-bin.000160' at position 2140880879, relay log '../relaylog/relay-bin.000001' position: 4; GTID position '0-1895939662-1128069'
      150606 10:45:16 [Note] Slave I/O thread: connected to master 'repl@192.168.1.113:5518',replication starts at GTID position '0-1895939662-1128069'
      

        Gliffy Diagrams

          Attachments

            Activity

            Hide
            elenst Elena Stepanova added a comment -

            Hi,
            Thanks for the report.

            10.0.12 is pretty old (1 year old), especially in terms of parallel replication which had a lot of fixes since then. At least these two seem to fit, but there might be others:

            MDEV-7326

            The user-visible effect in this hang is that at least one replication worker
            threads are stuck in state "Waiting for prior transaction to start commit
            before starting next transaction", and all other threads are stuck in this
            state or the state "Waiting for prior transaction to commit". (As seen in SHOW
            PROCESSLIST). Killing the worker threads will stop replication, and it can
            then be re-started successfully.

            MDEV-7929

            The parallel replication worker threads can hang in some cases with
            non-transactional event groups. The symptom is that worker threads are stuck
            in "waiting for prior transaction to start commit".

            I recommend you to consider upgrading to the recent 10.0 version (currently 10.0.19, and there will be 10.0.20 released soon). Until then, because different failures in this area often have identical representation, it is going to be very difficult, if not impossible, to understand whether you are hitting an old bug or a new one.

            Show
            elenst Elena Stepanova added a comment - Hi, Thanks for the report. 10.0.12 is pretty old (1 year old), especially in terms of parallel replication which had a lot of fixes since then. At least these two seem to fit, but there might be others: MDEV-7326 The user-visible effect in this hang is that at least one replication worker threads are stuck in state "Waiting for prior transaction to start commit before starting next transaction", and all other threads are stuck in this state or the state "Waiting for prior transaction to commit". (As seen in SHOW PROCESSLIST). Killing the worker threads will stop replication, and it can then be re-started successfully. MDEV-7929 The parallel replication worker threads can hang in some cases with non-transactional event groups. The symptom is that worker threads are stuck in "waiting for prior transaction to start commit". I recommend you to consider upgrading to the recent 10.0 version (currently 10.0.19, and there will be 10.0.20 released soon). Until then, because different failures in this area often have identical representation, it is going to be very difficult, if not impossible, to understand whether you are hitting an old bug or a new one.
            Hide
            linzhoukai Zhoukai Lin added a comment - - edited

            I think it may be different with MDEV-7326, in that case Killing the worker threads will stop replication, and it can
            then be re-started successfully.But in my case, I cannot restart replication by execute 'stop slave' or shutdown the MariaDB server, both will stuck. Only kill -9 pid can stop it.

            Show
            linzhoukai Zhoukai Lin added a comment - - edited I think it may be different with MDEV-7326 , in that case Killing the worker threads will stop replication, and it can then be re-started successfully.But in my case, I cannot restart replication by execute 'stop slave' or shutdown the MariaDB server, both will stuck. Only kill -9 pid can stop it.
            Hide
            elenst Elena Stepanova added a comment -

            I think what Kristian's comment means is that one needed to kill the 16 parallel threads directly (with KILL <thread id>) rather than via STOP SLAVE. Did you try that?

            There have also been some fixes related to deadlocks on stopping parallel replication and/or parallel replication threads; moreover, we still have some open ones, e.g. MDEV-7980, MDEV-8039 just to name a couple. So it's possible that you hit two different problems at once.

            How often does it happen for you? Would you consider upgrading one of your servers to see if the problem goes away?

            Show
            elenst Elena Stepanova added a comment - I think what Kristian's comment means is that one needed to kill the 16 parallel threads directly (with KILL <thread id>) rather than via STOP SLAVE . Did you try that? There have also been some fixes related to deadlocks on stopping parallel replication and/or parallel replication threads; moreover, we still have some open ones, e.g. MDEV-7980 , MDEV-8039 just to name a couple. So it's possible that you hit two different problems at once. How often does it happen for you? Would you consider upgrading one of your servers to see if the problem goes away?
            Hide
            elenst Elena Stepanova added a comment -

            Zhoukai Lin,
            Are you still experiencing the problem?

            Show
            elenst Elena Stepanova added a comment - Zhoukai Lin , Are you still experiencing the problem?
            Hide
            elenst Elena Stepanova added a comment -

            Closing as incomplete for now. Please comment to re-open if you still experience the problem and have new information.

            Show
            elenst Elena Stepanova added a comment - Closing as incomplete for now. Please comment to re-open if you still experience the problem and have new information.

              People

              • Assignee:
                Unassigned
                Reporter:
                linzhoukai Zhoukai Lin
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: