Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-7395

change master to Relay_Log_File/log_pos breaks Relay_Master_Log_File:Exec_Master_Log_Pos

    Details

    • Type: Task
    • Status: Stalled
    • Priority: Major
    • Resolution: Unresolved
    • Fix Version/s: 10.0
    • Component/s: Replication
    • Labels:
      None

      Description

      After thinking I was well on the way to recovering a slave:

                        Master_Port: 3306
                      Connect_Retry: 60  
                    Master_Log_File: mysql-bin.006189
                Read_Master_Log_Pos: 50082470  
                     Relay_Log_File: mysqld-relay-bin.001144
                      Relay_Log_Pos: 104857605  
              Relay_Master_Log_File: mysql-bin.006105
                   Slave_IO_Running: Yes  
                  Slave_SQL_Running: Yes
                    Replicate_Do_DB:   
                Replicate_Ignore_DB: 
                 Replicate_Do_Table:   
             Replicate_Ignore_Table: 
            Replicate_Wild_Do_Table:   
        Replicate_Wild_Ignore_Table: 
                         Last_Errno: 0  
                         Last_Error: 
                       Skip_Counter: 0  
                Exec_Master_Log_Pos: 9258445
                    Relay_Log_Space: 8480309601  
                    Until_Condition: None
                     Until_Log_File:   
                      Until_Log_Pos: 0
                 Master_SSL_Allowed: No  
                 Master_SSL_CA_File: 
                 Master_SSL_CA_Path:   
                    Master_SSL_Cert: 
                  Master_SSL_Cipher:   
                     Master_SSL_Key: 
              Seconds_Behind_Master: 963719  
      Master_SSL_Verify_Server_Cert: No
                      Last_IO_Errno: 0  
                      Last_IO_Error: 
                     Last_SQL_Errno: 0  
                     Last_SQL_Error: 
        Replicate_Ignore_Server_Ids:   
                   Master_Server_Id: 12022
                     Master_SSL_Crl:   
                 Master_SSL_Crlpath: 
                         Using_Gtid: No  
                        Gtid_IO_Pos: 
      1 row in set (0.00 sec)
      (interval < 30 sec)
      
      stop slave;
      set global sync_binlog=0, global innodb_flush_log_at_trx_commit=0
      start slave;
      
                       Master_Host: 192.168.179.103
                        Master_User: replication
                        Master_Port: 3306
                      Connect_Retry: 60  
                    Master_Log_File: mysql-bin.006189
                Read_Master_Log_Pos: 54032040
                     Relay_Log_File: mysqld-relay-bin.001144
                      Relay_Log_Pos: 104857605
              Relay_Master_Log_File: mysql-bin.006105
                   Slave_IO_Running: No
                  Slave_SQL_Running: No
                    Replicate_Do_DB: 
                Replicate_Ignore_DB: 
                 Replicate_Do_Table: 
             Replicate_Ignore_Table: 
            Replicate_Wild_Do_Table: 
        Replicate_Wild_Ignore_Table: 
                         Last_Errno: 1594  
                         Last_Error: Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave.
                       Skip_Counter: 0 
                Exec_Master_Log_Pos: 61589175
                    Relay_Log_Space: 8484259756
                    Until_Condition: None
                     Until_Log_File:   
                      Until_Log_Pos: 0
                 Master_SSL_Allowed: No  
                 Master_SSL_CA_File: 
                 Master_SSL_CA_Path:   
                    Master_SSL_Cert: 
                  Master_SSL_Cipher: 
                     Master_SSL_Key: 
              Seconds_Behind_Master: NULL
      Master_SSL_Verify_Server_Cert: No
                      Last_IO_Errno: 0
                      Last_IO_Error:   
                     Last_SQL_Errno: 1594
                     Last_SQL_Error: Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave.
        Replicate_Ignore_Server_Ids: 
                   Master_Server_Id: 12022
                     Master_SSL_Crl: 
                 Master_SSL_Crlpath: 
                         Using_Gtid: No
      
      Looking at the relay file:
      ls -al mysqld-relay-bin.001144
      -rw-rw---- 1 mysql mysql 104857705 Dec 19 03:23 mysqld-relay-bin.001144
      
       mysqlbinlog --start-position 104857605 mysqld-relay-bin.001144 | more
      ERROR: /*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
      /*!40019 SET @@session.max_insert_delayed_threads=0*/;
      Error in Log_event::read_log_event(): 'Event too big', data_len: 1163019592, event_type: 48
      /*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
      ERROR: DELIMITER /*!*/;
      Could not read entry at offset 104857605: Error in log format or read error.# at 4
      
      #141215  0:30:57 server id 12025  end_log_pos 248       Start: binlog v 4, server v 10.0.15-MariaDB-1~wheezy-log created 141215  0:30:57
      BINLOG '
      wSuOVA/5LgAA9AAAAPgAAAAAAAQAMTAuMC4xNS1NYXJpYURCLTF+d2...
      AAAAAAAAAAAEEwQAj+72WQ==
      '/*!*/;
      # at 291
      #141218 23:29:46 server id 12022  end_log_pos 248       Start: binlog v 4, server v 10.0.15-MariaDB-1~wheezy-log created 141218 23:29:46
      BINLOG '
      amOTVA/2LgAA9AAAAPgAAAAAAAQAM.....
      AAAAAAAAAAAEEwQA8pOnBg==
      '/*!*/;
      DELIMITER ;
      # End of log file
      ROLLBACK /* added by mysqlbinlog */;
      /*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;
      /*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=0*/;
      
      

      I concluded its 100 bytes from the end of the file, going to reposition to the the next file.

      MariaDB [(none)]> change master to relay_log_file='mysqld-relay-bin.001145',RELAY_LOG_POS=0;

      start slave

      I soon (<2 mins) there after found a duplicate key error on an insert which I suspected was related and quickly clear it. set sql_log_bin=0; REPLACE INTO seller_counters (seller_id, category_id, counter) VALUES (36239, 321, 1); set global sql_slave_skip_counter=1; start slave

      The show slave status soon (<2 mins) after found another error:

      Relay_Log_File: mysqld-relay-bin.001147
                      Relay_Log_Pos: 7901913
              Relay_Master_Log_File: mysql-bin.006189
                   Slave_IO_Running: Yes
                  Slave_SQL_Running: No
      

      (sorry I didn't record the full output at the time)

      So from position:
      Relay_Log_File: mysqld-relay-bin.001144
      Relay_Log_Pos: 104857605
      Relay_Master_Log_File: mysql-bin.006105
      Exec_Master_Log_Pos: 9258445

      until a very short time later after a change master to changed the relay_log_file/pos to the next file and with a 100 bytes changed the new position is:

      Relay_Log_File: mysqld-relay-bin.001147
      Relay_Log_Pos: 7901913
      Relay_Master_Log_File: mysql-bin.006189

      The relay log files are 100M though though probably a logrotate not using shared scripts the following pattern occurs about 3 hourly

      -rw-rw----  1 mysql  mysql  104857912 Dec 21 00:33 mysqld-relay-bin.001187
      -rw-rw----  1 mysql  mysql        569 Dec 21 00:33 mysqld-relay-bin.001188
      -rw-rw----  1 mysql  mysql        341 Dec 21 00:33 mysqld-relay-bin.001189
      

      The master log files are more strict on 100M an only differ on on what appears to be a daily logrotate.

      The relay log file has gone from 001144 to 001147 in the same time Relay_Master_Log_File has gone from 006105 (19th December) to mysql-bin.006189 (30 December, which is the current binlog on the master).

      binlog_format | STATEMENT

      A few hours later I decided to capture this right without gaps.

      Take a stopped slave position and ignoring this error (i'll put a new bug for it).

                                                          • 1. row ***************************
                                                            Slave_IO_State:
                                                            Master_Host: 192.168.179.103
                                                            Master_User: replication
                                                            Master_Port: 3306
                                                            Connect_Retry: 60
                                                            Master_Log_File: mysql-bin.006189
                                                            Read_Master_Log_Pos: 96248700
                                                            Relay_Log_File: mysqld-relay-bin.001150
                                                            Relay_Log_Pos: 94786159
                                                            Relay_Master_Log_File: mysql-bin.006189
                                                            Slave_IO_Running: No
                                                            Slave_SQL_Running: No
                                                            Replicate_Do_DB:
                                                            Replicate_Ignore_DB:
                                                            Replicate_Do_Table:
                                                            Replicate_Ignore_Table:
                                                            Replicate_Wild_Do_Table:
                                                            Replicate_Wild_Ignore_Table:
                                                            Last_Errno: 1594
                                                            Last_Error: Relay log read failure: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbin
                                                            log' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you
                                                            want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave.
                                                            Skip_Counter: 0
                                                            Exec_Master_Log_Pos: 94785872
                                                            Relay_Log_Space: 8316760144

      change master to Relay_Log_File='mysqld-relay-bin.001150',relay_log_pos=94785831; show slave status;

      Slave_IO_State:
      Master_Host: 192.168.179.103
      Master_User: replication
      Master_Port: 3306
      Connect_Retry: 60
      Master_Log_File: mysql-bin.006189
      Read_Master_Log_Pos: 96248700
      Relay_Log_File: mysqld-relay-bin.001150
      Relay_Log_Pos: 94785831
      Relay_Master_Log_File: mysql-bin.006189
      Slave_IO_Running: No
      Slave_SQL_Running: No
      Replicate_Do_DB:
      Replicate_Ignore_DB:
      Replicate_Do_Table:
      Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
      Replicate_Wild_Ignore_Table:
      Last_Errno: 0
      Last_Error:
      Skip_Counter: 0
      Exec_Master_Log_Pos: 96248700
      Relay_Log_Space: 8316760144
      Until_Condition: None

      so changing Relay_Log_Pos: 94785831
      to Relay_Log_Pos: 94786159

      changed

      Exec_Master_Log_Pos: 94785872
      Exec_Master_Log_Pos: 96248700

      so changing the relay_log_pos back 300 bytes made the exec_master_log_pos jump forward 1.5M.

      Looks like this is grabbing a current master position when then relay_log_pos/file is changed.

        Gliffy Diagrams

          Attachments

            Issue Links

              Activity

              Hide
              knielsen Kristian Nielsen added a comment -

              This sounds a lot like MDEV-7237, which is fixed in 10.0.16.

              But if 10.0.16 (when it is released) does not fix it, then please feel free to re-open this bug with updated info.

              Show
              knielsen Kristian Nielsen added a comment - This sounds a lot like MDEV-7237 , which is fixed in 10.0.16. But if 10.0.16 (when it is released) does not fix it, then please feel free to re-open this bug with updated info.
              Hide
              danblack Daniel Black added a comment - - edited

              This sounds a lot like MDEV-7237, which is fixed in 10.0.16.

              Doesn't to me. The Relay_Master_log_file:Exec_master_log_pos became corrupted in the absence of a running sql or io thread.

              Show
              danblack Daniel Black added a comment - - edited This sounds a lot like MDEV-7237 , which is fixed in 10.0.16. Doesn't to me. The Relay_Master_log_ file:Exec_master_log_pos became corrupted in the absence of a running sql or io thread.
              Hide
              danblack Daniel Black added a comment -

              also in the second example the position changed and not the file.

              Show
              danblack Daniel Black added a comment - also in the second example the position changed and not the file.
              Hide
              arjen Arjen Lentz added a comment -

              Kristian - this doesn't appear to be fixed (as per Daniel testing) and thus this would not be a duplicate of another fixed issue.

              Show
              arjen Arjen Lentz added a comment - Kristian - this doesn't appear to be fixed (as per Daniel testing) and thus this would not be a duplicate of another fixed issue.
              Hide
              knielsen Kristian Nielsen added a comment -

              > Kristian - this doesn't appear to be fixed (as per Daniel testing) and thus
              > this would not be a duplicate of another fixed issue.

              Arjen, so what do you expect me to do with this issue?

              • Kristian.
              Show
              knielsen Kristian Nielsen added a comment - > Kristian - this doesn't appear to be fixed (as per Daniel testing) and thus > this would not be a duplicate of another fixed issue. Arjen, so what do you expect me to do with this issue? Kristian.
              Hide
              danblack Daniel Black added a comment -

              upstream bug with test case thanks to Elena Stepanova

              Show
              danblack Daniel Black added a comment - upstream bug with test case thanks to Elena Stepanova
              Hide
              arjen Arjen Lentz added a comment -

              Hi Kristian

              Well, if it's not fixed then it can't be closed - doesn't mean I know exactly where we should go next, but as a baseline it seems like the right direction. That way others would also be able to add additional information.
              My thoughts at this stage:

              • Does the upstream bug provide any lead on where to look?
              • Also Daniel's earlier comment on that particular variable getting corrupted might be something that can be tracked in the source, as I presume it doesn't get written to in that many places.
              • If it's corruption in a file (can't discern that for sure from Dan's comment), there might be a logic error when writing that file (double linefeed, that kind of nonsense) in certain cases.

              Thanks
              Arjen.

              Show
              arjen Arjen Lentz added a comment - Hi Kristian Well, if it's not fixed then it can't be closed - doesn't mean I know exactly where we should go next, but as a baseline it seems like the right direction. That way others would also be able to add additional information. My thoughts at this stage: Does the upstream bug provide any lead on where to look? Also Daniel's earlier comment on that particular variable getting corrupted might be something that can be tracked in the source, as I presume it doesn't get written to in that many places. If it's corruption in a file (can't discern that for sure from Dan's comment), there might be a logic error when writing that file (double linefeed, that kind of nonsense) in certain cases. Thanks Arjen.
              Hide
              danblack Daniel Black added a comment -

              I think the upstream bug is an exact duplicate.

              The cause of the corrupted position is hopefully fixed as per the other bugs.

              I see trying to map any relay log file/pos changes back to the master log files/pos as quite a hard problem I assume if there is no easy way.

              Invalidating the display of the relay_master_log_file:exec_master_log_pos until some IO/SQL thread convergence occurs may be a solution.

              Could just documented as don't rely on the relay_master_log_file:pos after changing the local relay file/pos.

              I have to admit that this might be the first time I've actually changed relay_log_file/pos, and i only tried to do it because of another error corrupting the positions, and expired master binlogs making a resync there impossible. So I doubt its something commonly done.

              Show
              danblack Daniel Black added a comment - I think the upstream bug is an exact duplicate. The cause of the corrupted position is hopefully fixed as per the other bugs. I see trying to map any relay log file/pos changes back to the master log files/pos as quite a hard problem I assume if there is no easy way. Invalidating the display of the relay_master_log_ file:exec_master_log_pos until some IO/SQL thread convergence occurs may be a solution. Could just documented as don't rely on the relay_master_log_ file:pos after changing the local relay file/pos. I have to admit that this might be the first time I've actually changed relay_log_file/pos, and i only tried to do it because of another error corrupting the positions, and expired master binlogs making a resync there impossible. So I doubt its something commonly done.

                People

                • Assignee:
                  knielsen Kristian Nielsen
                  Reporter:
                  danblack Daniel Black
                • Votes:
                  0 Vote for this issue
                  Watchers:
                  4 Start watching this issue

                  Dates

                  • Created:
                    Updated: