Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-7323

Replica with GTID stopping replication on some transactions when no gtid replica didn't stop

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Cannot Reproduce
    • Affects Version/s: 10.0.14
    • Fix Version/s: N/A
    • Component/s: Replication
    • Labels:
      None
    • Environment:
      Redhat Linux 2.6.32-431.29.2.el6.x86_64
      VM on Citrix Xen 16GB RAM 8vcpu

      Description

      We have had replication stop on our GTID enabled slave when the same events are passing through
      Things such as this have failed and skip has been executed to allow replication to continue

      Last_Errno: 1396
      Last_Error: Error 'Operation CREATE USER failed for 'user1'@'192.168.x.xx'' on query. Default database: 'information_schema'. Query: 'CREATE USER ''user1'@'192.168.x.xx''
      Actual user details changed for security reasons

      Drop user has stalled and a error from what I believe to be a stored procedure which inserts some records into a table has caused issues.

        Gliffy Diagrams

          Attachments

            Activity

            Hide
            knielsen Kristian Nielsen added a comment - - edited

            We need some more details to understand what the problem here is.

            1. You mention "slave when the same events are passing through". Do you mean that you have a setup S1->S2->S3, where S2 is using GTID but S3 is not? Or something else? What makes you think the problem is related to GTID?

            2. Might there be some reasonable explanation for the CREATE USER statement to fail, like the user existing before or something like that? What is in the error log? What happens if the CREATE USER statement is run manually on the server?

            3. With respect to the "Drop user has stalled and ..." - we really need a detailed, precise description of each problem to be able to say anything meaningful.

            Show
            knielsen Kristian Nielsen added a comment - - edited We need some more details to understand what the problem here is. 1. You mention "slave when the same events are passing through". Do you mean that you have a setup S1->S2->S3, where S2 is using GTID but S3 is not? Or something else? What makes you think the problem is related to GTID? 2. Might there be some reasonable explanation for the CREATE USER statement to fail, like the user existing before or something like that? What is in the error log? What happens if the CREATE USER statement is run manually on the server? 3. With respect to the "Drop user has stalled and ..." - we really need a detailed, precise description of each problem to be able to say anything meaningful.
            Hide
            Meerkat63 Peter McLarty added a comment - - edited

            I have

             master -> slave 1 
                    -> slave 2
            

            My thinking around gtid is that the problems are only occurring on the gtid enabled slave

            MariaDB [(none)]> show slave status\G
            *************************** 1. row ***************************
                           Slave_IO_State: Queueing master event to the relay log
                              Master_Host: 192.168.x.xx
                              Master_User: repl
                              Master_Port: 3306
                            Connect_Retry: 60
                          Master_Log_File: mysql-bin.000010
                      Read_Master_Log_Pos: 442468943
                           Relay_Log_File: mysql-relay-bin.000002
                            Relay_Log_Pos: 1967113
                    Relay_Master_Log_File: mysql-bin.000010
                         Slave_IO_Running: Yes
                        Slave_SQL_Running: Yes
                          Replicate_Do_DB:
                      Replicate_Ignore_DB:
                       Replicate_Do_Table:
                   Replicate_Ignore_Table:
                  Replicate_Wild_Do_Table:
              Replicate_Wild_Ignore_Table:
                               Last_Errno: 0
                               Last_Error:
                             Skip_Counter: 0
                      Exec_Master_Log_Pos: 320379884
                          Relay_Log_Space: 124056469
                          Until_Condition: None
                           Until_Log_File:
                            Until_Log_Pos: 0
                       Master_SSL_Allowed: No
                       Master_SSL_CA_File:
                       Master_SSL_CA_Path:
                          Master_SSL_Cert:
                        Master_SSL_Cipher:
                           Master_SSL_Key:
                    Seconds_Behind_Master: 25425
            Master_SSL_Verify_Server_Cert: No
                            Last_IO_Errno: 0
                            Last_IO_Error:
                           Last_SQL_Errno: 0
                           Last_SQL_Error:
              Replicate_Ignore_Server_Ids:
                         Master_Server_Id: 148
                           Master_SSL_Crl:
                       Master_SSL_Crlpath:
                               Using_Gtid: Slave_Pos
                              Gtid_IO_Pos: 0-148-5430559
            1 row in set (0.00 sec)
            

            I cannot access ftp server at present from internal network will upload the error log later this evening. A number of errors which have stopped the slave replication in that log

            Show
            Meerkat63 Peter McLarty added a comment - - edited I have master -> slave 1 -> slave 2 My thinking around gtid is that the problems are only occurring on the gtid enabled slave MariaDB [(none)]> show slave status\G *************************** 1. row *************************** Slave_IO_State: Queueing master event to the relay log Master_Host: 192.168.x.xx Master_User: repl Master_Port: 3306 Connect_Retry: 60 Master_Log_File: mysql-bin.000010 Read_Master_Log_Pos: 442468943 Relay_Log_File: mysql-relay-bin.000002 Relay_Log_Pos: 1967113 Relay_Master_Log_File: mysql-bin.000010 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 320379884 Relay_Log_Space: 124056469 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: No Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 25425 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 148 Master_SSL_Crl: Master_SSL_Crlpath: Using_Gtid: Slave_Pos Gtid_IO_Pos: 0-148-5430559 1 row in set (0.00 sec) I cannot access ftp server at present from internal network will upload the error log later this evening. A number of errors which have stopped the slave replication in that log
            Hide
            Meerkat63 Peter McLarty added a comment -

            Uploaded the error log to ftp.askmonty.org. Let me know what else to investigate and what tools and I will endeavour to find the information

            Show
            Meerkat63 Peter McLarty added a comment - Uploaded the error log to ftp.askmonty.org. Let me know what else to investigate and what tools and I will endeavour to find the information
            Hide
            knielsen Kristian Nielsen added a comment -

            Thank you for uploading the error log. I took a quick look at it.

            The error log contains a lot of replication errors, like "Can't find record in
            ..." errors, and messages that such errors were skipped over using
            @@sql_slave_skip_counter. This suggests that the slave is diverged from the
            master (the slave does not have the same data as the master had at the
            corresponding replication position).

            A diverged slave can in general lead to various replication failures, if some
            query or operation executed on the slave sees different data from what was
            originally on the master. It seems likely that this could also be the cause of
            the failure of CREATE USER, though I cannot know this for sure one way or the
            other from the data available.

            If the problem can be reproduced on a slave that has been re-provisioned to be
            identical to the master, then that would make it easier to say more.

            Or alternatively, the actual problem needs to be narrowed down more
            clearly. For example, if the CREATE USER statement fails in replication, but
            then the exact same statement executed manually on the slave server succeeds,
            then that would be something that could be investigated further. Or if it can
            be shown that GTID replication somehow tries to replicate the wrong statement.

            From the information given, the most likely explanation is just that the slave
            data is sufficiently different from the master data to cause the errors
            mentioned.

            Show
            knielsen Kristian Nielsen added a comment - Thank you for uploading the error log. I took a quick look at it. The error log contains a lot of replication errors, like "Can't find record in ..." errors, and messages that such errors were skipped over using @@sql_slave_skip_counter. This suggests that the slave is diverged from the master (the slave does not have the same data as the master had at the corresponding replication position). A diverged slave can in general lead to various replication failures, if some query or operation executed on the slave sees different data from what was originally on the master. It seems likely that this could also be the cause of the failure of CREATE USER, though I cannot know this for sure one way or the other from the data available. If the problem can be reproduced on a slave that has been re-provisioned to be identical to the master, then that would make it easier to say more. Or alternatively, the actual problem needs to be narrowed down more clearly. For example, if the CREATE USER statement fails in replication, but then the exact same statement executed manually on the slave server succeeds, then that would be something that could be investigated further. Or if it can be shown that GTID replication somehow tries to replicate the wrong statement. From the information given, the most likely explanation is just that the slave data is sufficiently different from the master data to cause the errors mentioned.
            Hide
            knielsen Kristian Nielsen added a comment -

            Without further information, I will have to assume that this is caused by the slave being out of sync with the master. So closing ...

            Show
            knielsen Kristian Nielsen added a comment - Without further information, I will have to assume that this is caused by the slave being out of sync with the master. So closing ...
            Hide
            Meerkat63 Peter McLarty added a comment -

            Thanks guys it was a replication issue pt-table-sync was not able to repair and we have rebuilt the slave

            Show
            Meerkat63 Peter McLarty added a comment - Thanks guys it was a replication issue pt-table-sync was not able to repair and we have rebuilt the slave

              People

              • Assignee:
                knielsen Kristian Nielsen
                Reporter:
                Meerkat63 Peter McLarty
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: