Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-4329

CHANGE MASTER ... master_gtid_pos='' does not reset the position

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      I'm trying to tweak the test case initially described in MDEV-4325 to make it work. As discussed in the comments (https://mariadb.atlassian.net/browse/MDEV-4325?focusedCommentId=30821&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-30821), I'm now setting master_gtid_pos='' after slave reset. It still does not seem to work:

      [connection master]
      RESET MASTER;
      include/stop_slave.inc
      RESET SLAVE ALL;
      CHANGE MASTER TO master_host='127.0.0.1', master_port=16000, master_user='root', master_gtid_pos=auto;
      include/start_slave.inc
      CREATE TABLE t1 (i INT);
      include/stop_slave.inc
      DROP TABLE t1;
      RESET SLAVE;
      CHANGE MASTER TO master_gtid_pos='';
      ####################################################
      # We have set master_gtid_pos to '', so it's 
      # expected to be empty now (and it is)
      ####################################################
      SHOW ALL SLAVES STATUS;
      Connection_name	
      Slave_SQL_State	
      Slave_IO_State	
      Master_Host	127.0.0.1
      Master_User	root
      Master_Port	16000
      Connect_Retry	1
      Master_Log_File	
      Read_Master_Log_Pos	0
      Relay_Log_File	slave-relay-bin.000001
      Relay_Log_Pos	4
      Relay_Master_Log_File	
      Slave_IO_Running	No
      Slave_SQL_Running	No
      Replicate_Do_DB	
      Replicate_Ignore_DB	
      Replicate_Do_Table	
      Replicate_Ignore_Table	
      Replicate_Wild_Do_Table	
      Replicate_Wild_Ignore_Table	
      Last_Errno	0
      Last_Error	
      Skip_Counter	0
      Exec_Master_Log_Pos	0
      Relay_Log_Space	248
      Until_Condition	None
      Until_Log_File	
      Until_Log_Pos	0
      Master_SSL_Allowed	No
      Master_SSL_CA_File	
      Master_SSL_CA_Path	
      Master_SSL_Cert	
      Master_SSL_Cipher	
      Master_SSL_Key	
      Seconds_Behind_Master	NULL
      Master_SSL_Verify_Server_Cert	No
      Last_IO_Errno	0
      Last_IO_Error	
      Last_SQL_Errno	0
      Last_SQL_Error	
      Replicate_Ignore_Server_Ids	
      Master_Server_Id	1
      Using_Gtid	1
      Retried_transactions	0
      Max_relay_log_size	1073741824
      Executed_log_entries	16
      Slave_received_heartbeats	0
      Slave_heartbeat_period	60.000
      Gtid_Pos	
      ####################################################
      # But it still claims we are using an invalid value 
      ####################################################
      include/start_slave.inc
      SHOW SLAVE STATUS;
      Slave_IO_State	
      Master_Host	127.0.0.1
      Master_User	root
      Master_Port	16000
      Connect_Retry	1
      Master_Log_File	
      Read_Master_Log_Pos	0
      Relay_Log_File	slave-relay-bin.000001
      Relay_Log_Pos	4
      Relay_Master_Log_File	
      Slave_IO_Running	No
      Slave_SQL_Running	Yes
      Replicate_Do_DB	
      Replicate_Ignore_DB	
      Replicate_Do_Table	
      Replicate_Ignore_Table	
      Replicate_Wild_Do_Table	
      Replicate_Wild_Ignore_Table	
      Last_Errno	0
      Last_Error	
      Skip_Counter	0
      Exec_Master_Log_Pos	0
      Relay_Log_Space	248
      Until_Condition	None
      Until_Log_File	
      Until_Log_Pos	0
      Master_SSL_Allowed	No
      Master_SSL_CA_File	
      Master_SSL_CA_Path	
      Master_SSL_Cert	
      Master_SSL_Cipher	
      Master_SSL_Key	
      Seconds_Behind_Master	NULL
      Master_SSL_Verify_Server_Cert	No
      Last_IO_Errno	1236
      Last_IO_Error	Got fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 0-2-2, which is not in the master's binlog'
      Last_SQL_Errno	0
      Last_SQL_Error	
      Replicate_Ignore_Server_Ids	
      Master_Server_Id	1
      Using_Gtid	1
      

      Test case:

      
      --source include/master-slave.inc
      --source include/have_xtradb.inc
      --source include/have_binlog_format_mixed.inc
      
      RESET MASTER;
      
      --connection slave
      --source include/stop_slave.inc
      RESET SLAVE ALL;
      eval CHANGE MASTER TO master_host='127.0.0.1', master_port=$MASTER_MYPORT, 
           master_user='root', master_gtid_pos=auto;
      --source include/start_slave.inc
      
      --connection master
      CREATE TABLE t1 (i INT);
      --save_master_pos
      
      --sync_slave_with_master
      --source include/stop_slave.inc
      DROP TABLE t1;
      RESET SLAVE;
      # We can optionally delete the contents of the table,
      # it doesn't help anyway
      # DELETE FROM mysql.rpl_slave_state;
      eval CHANGE MASTER TO master_gtid_pos='';
      
      --echo ####################################################
      --echo # We have set master_gtid_pos to '', so it's 
      --echo # expected to be empty now (and it is)
      --echo ####################################################
      query_vertical SHOW ALL SLAVES STATUS;
      
      --echo ####################################################
      --echo # But it still claims we are using an invalid value 
      --echo ####################################################
      
      --source include/start_slave.inc
      --sleep 1
      query_vertical SHOW SLAVE STATUS;
      
      revision-id: knielsen@knielsen-hq.org-20130322102628-hxohewmbfyd1wig6
      revno: 3538
      branch-nick: 10.0-mdev26
      

        Gliffy Diagrams

          Attachments

            Issue Links

              Activity

              Hide
              elenst Elena Stepanova added a comment -

              Please also note that the expected position looks corrupted: how did it suddenly become 0-2-.. ? If anything, it should still be 0-1-...

              Show
              elenst Elena Stepanova added a comment - Please also note that the expected position looks corrupted: how did it suddenly become 0-2-.. ? If anything, it should still be 0-1-...
              Hide
              knielsen Kristian Nielsen added a comment -

              It's the same problem again I need to get this fixed properly.

              So the issue here is that the slave has GTID 0-2-2 in its binlog (the DROP
              TABLE t1). When the slave connects, it sees that the binlog has something
              newer, and appends it to the slave state. If one adds RESET MASTER on the
              slave, it works.

              But this is highly unacceptable behaviour, of course. I thought I implemented
              something such that the CHANGE MASTER MASTER_GTID_POS="" gives an error in
              this case, suggesting the RESET MASTER. I will see if I can get this to work
              properly in your case.

              But I'm starting to think the root problem is deeper. There are two different
              situations that conflict here. One is when a master is changed to a slave, and
              I want it to automagically resume from the position in its binlog. The other
              is user explicitly setting manually a start position, which should not be
              overridden by the binlog, of course.

              I'm wondering if I'm trying to make things too magic. Maybe it would be better
              if I never automatically use the binlog to determine where slave
              starts.

              Instead, if user wants to make old master into a new slave, they can
              explicitly do SHOW MASTER STATUS (when that is implemented) to get the binlog
              state and use that for CHANGE MASTER TO MASTER_GTID_POS='xxx'. Or I could
              implement a special MASTER_GTID_POS=MASTER_STATE.

              Or maybe I can fix it so they get an error instead of surprising behaviour.

              Let's discuss this on IRC or something, I really want to get this working
              properly!

              Show
              knielsen Kristian Nielsen added a comment - It's the same problem again I need to get this fixed properly. So the issue here is that the slave has GTID 0-2-2 in its binlog (the DROP TABLE t1). When the slave connects, it sees that the binlog has something newer, and appends it to the slave state. If one adds RESET MASTER on the slave, it works. But this is highly unacceptable behaviour, of course. I thought I implemented something such that the CHANGE MASTER MASTER_GTID_POS="" gives an error in this case, suggesting the RESET MASTER. I will see if I can get this to work properly in your case. But I'm starting to think the root problem is deeper. There are two different situations that conflict here. One is when a master is changed to a slave, and I want it to automagically resume from the position in its binlog. The other is user explicitly setting manually a start position, which should not be overridden by the binlog, of course. I'm wondering if I'm trying to make things too magic. Maybe it would be better if I never automatically use the binlog to determine where slave starts. Instead, if user wants to make old master into a new slave, they can explicitly do SHOW MASTER STATUS (when that is implemented) to get the binlog state and use that for CHANGE MASTER TO MASTER_GTID_POS='xxx'. Or I could implement a special MASTER_GTID_POS=MASTER_STATE. Or maybe I can fix it so they get an error instead of surprising behaviour. Let's discuss this on IRC or something, I really want to get this working properly!
              Hide
              elenst Elena Stepanova added a comment - - edited

              >> So the issue here is that the slave has GTID 0-2-2 in its binlog (the DROP
              >> TABLE t1). When the slave connects, it sees that the binlog has something
              >> newer, and appends it to the slave state. If one adds RESET MASTER on the
              >> slave, it works.

              Okay, now I understand where 0-2-2 comes from, but the error message itself is highly confusing.
              'Error: connecting slave requested to start from GTID 0-2-2, which is not in the master's binlog'
              First, we didn't request slave to start from GTID 0-2-2; secondly, of course it's not in the master's binlog – why would it be, master has 0-1-...
              We need to re-word it somehow.

              >> But this is highly unacceptable behaviour, of course. I thought I implemented
              >> something such that the CHANGE MASTER MASTER_GTID_POS="" gives an error in
              >> this case, suggesting the RESET MASTER. I will see if I can get this to work
              >> properly in your case.

              But in this case, I do NOT want to do RESET MASTER! On the contrary, I want to replay the existing master binlog from the beginning, which is why I drop table t1 (so that it doesn't cause an error when slave attempts to execute the create table event).
              I tried to describe scenarios that I had in mind in more details (maybe in excessive details) in MDEV-4325.

              >> The other
              >> is user explicitly setting manually a start position, which should not be
              >> overridden by the binlog, of course.

              That's right, in this particular case I expected my explicit setting to work rather than be overridden by auto magic; especially since, as it was discussed before, it's the only way to actually reset the GTID position.

              >> I'm wondering if I'm trying to make things too magic. Maybe it would be better
              >> if I never automatically use the binlog to determine where slave
              >> starts.

              'auto' mode is still a mystery for me, so I don't have a strong opinion yet.

              >> Let's discuss this on IRC or something, I really want to get this working
              >> properly!

              Yep, let's. At this point I'm especially interested in figuring out the difference between the three cases:
              1) we use old-fashioned way to configure the slave (master_log_pos/master_log_file);
              2) we use an explicit value of GTID position to start replication;
              3) we use master_gtid_pos=auto

              How these three cases are supposed to differ, what are expected limitations of (1) comparing to (2) and (2) comparing to (3), etc.

              Show
              elenst Elena Stepanova added a comment - - edited >> So the issue here is that the slave has GTID 0-2-2 in its binlog (the DROP >> TABLE t1). When the slave connects, it sees that the binlog has something >> newer, and appends it to the slave state. If one adds RESET MASTER on the >> slave, it works. Okay, now I understand where 0-2-2 comes from, but the error message itself is highly confusing. 'Error: connecting slave requested to start from GTID 0-2-2, which is not in the master's binlog' First, we didn't request slave to start from GTID 0-2-2; secondly, of course it's not in the master's binlog – why would it be, master has 0-1-... We need to re-word it somehow. >> But this is highly unacceptable behaviour, of course. I thought I implemented >> something such that the CHANGE MASTER MASTER_GTID_POS="" gives an error in >> this case, suggesting the RESET MASTER. I will see if I can get this to work >> properly in your case. But in this case, I do NOT want to do RESET MASTER! On the contrary, I want to replay the existing master binlog from the beginning, which is why I drop table t1 (so that it doesn't cause an error when slave attempts to execute the create table event). I tried to describe scenarios that I had in mind in more details (maybe in excessive details) in MDEV-4325 . >> The other >> is user explicitly setting manually a start position, which should not be >> overridden by the binlog, of course. That's right, in this particular case I expected my explicit setting to work rather than be overridden by auto magic; especially since, as it was discussed before, it's the only way to actually reset the GTID position. >> I'm wondering if I'm trying to make things too magic. Maybe it would be better >> if I never automatically use the binlog to determine where slave >> starts. 'auto' mode is still a mystery for me, so I don't have a strong opinion yet. >> Let's discuss this on IRC or something, I really want to get this working >> properly! Yep, let's. At this point I'm especially interested in figuring out the difference between the three cases: 1) we use old-fashioned way to configure the slave (master_log_pos/master_log_file); 2) we use an explicit value of GTID position to start replication; 3) we use master_gtid_pos=auto How these three cases are supposed to differ, what are expected limitations of (1) comparing to (2) and (2) comparing to (3), etc.
              Hide
              knielsen Kristian Nielsen added a comment -

              > But in this case, I do NOT want to do RESET MASTER! On the contrary, I want
              > to replay the existing master binlog from the beginning, which is why I drop
              > table t1 (so that it doesn't cause an error when slave attempts to execute
              > the create table event).

              Yes, I understand. You need to RESET MASTER on the slave, not on the master.
              In fact the RESET MASTER on the slave is anyway a good idea. Without it, you
              would get duplicate events in the binlog on the slave, which would cause
              trouble if you were to use the slave as a master for a third server.

              A fundamental concept for MariaDB GTID is that binlog order must be identical
              across all servers (hence the "global") (when using multiple domains order
              must be identical only within each domain).

              I think it is getting to the point where I should use your feedback so far and
              write up some proper documentation. This will force me to think the whole
              thing through properly, and once written will allow you to work without
              fumbling too much in the dark.

              I worry that I still have so many gotchas in the user interface after several
              iterations, but hopefully we can find some way to make it work reasonably.

              Show
              knielsen Kristian Nielsen added a comment - > But in this case, I do NOT want to do RESET MASTER! On the contrary, I want > to replay the existing master binlog from the beginning, which is why I drop > table t1 (so that it doesn't cause an error when slave attempts to execute > the create table event). Yes, I understand. You need to RESET MASTER on the slave, not on the master. In fact the RESET MASTER on the slave is anyway a good idea. Without it, you would get duplicate events in the binlog on the slave, which would cause trouble if you were to use the slave as a master for a third server. A fundamental concept for MariaDB GTID is that binlog order must be identical across all servers (hence the "global") (when using multiple domains order must be identical only within each domain). I think it is getting to the point where I should use your feedback so far and write up some proper documentation. This will force me to think the whole thing through properly, and once written will allow you to work without fumbling too much in the dark. I worry that I still have so many gotchas in the user interface after several iterations, but hopefully we can find some way to make it work reasonably.
              Hide
              knielsen Kristian Nielsen added a comment -

              Ok, so turns out I made a simple mistake in the code, it is fixed now.
              Now the testcase gets this error message:

              mysqltest: At line 26: query 'CHANGE MASTER TO master_gtid_pos=''' failed: 1947: Requested MASTER_GTID_POS contains no value for replication domain 0. This conflicts with the binary log which contains GTID 0-2-2. To use the requested MASTER_GTID_POS, the old binlog must be removed with RESET MASTER to avoid out-of-order binlog

              So this is the new testcase:

              --connection master
              CREATE TABLE t1 (i INT);
              --sync_slave_with_master
              --source include/stop_slave.inc
              DROP TABLE t1;
              RESET SLAVE;
              --error ER_MASTER_GTID_POS_MISSING_DOMAIN
              eval CHANGE MASTER TO master_gtid_pos='';
              RESET MASTER;
              eval CHANGE MASTER TO master_gtid_pos='';
              --source include/start_slave.inc
              --sleep 1
              query_vertical SHOW ALL SLAVES STATUS;
              SELECT * FROM t1;
              

              So I've pushed this fix. However, I'm still open to discussing the deeper
              issue of whether this is the best way to handle things.

              It was a fundamental design decision I made early that I wanted the slave GTID
              state to be just a position in the binlog (or one per replication domain) -
              not a set of all applied GTIDs, like in the MySQL 5.6 design. This makes
              things simpler for the user, but it also gives the user a great
              responsibility: to ensure that binlogs are identical on all servers that can
              at some point become a master.

              Because GTID promises to allow to put any server as a slave of any other
              server with just MASTER_GTID_POS=AUTO. And the only thing the slave knows is
              the single GTID to start at. So starting from this GTID has to return the
              exact same sequence of events, no matter what master server is
              selected. Otherwise inconsistent/incorrect replication will result.

              So the lesson I took from your previous extensive feedback was to try much
              harder to protect the user from mistakes with inconsistent binlogs and
              configurations, and give errors in many more cases. Like in this one,
              unfortunately I missed the case MASTER_GTID_POS='', but fortunately you caught
              it immediately.

              Basically, with GTID, you can no longer do local changes on the slave without
              thinking about what goes into the slave binlog. Because only the current
              master within a domain is allowed to write to the binlog. One needs to do such
              local changes with SQL_LOG_BIN=0 if they are not meant to be replicated
              elsewhere, or clean them up afterwards with RESET MASTER (on the slave).

              There is definitely still need for improvement with this and the user
              interface in general. So I suggest we continue the discussion in the context
              of your excellent analysis in MDEV-4325.

              Show
              knielsen Kristian Nielsen added a comment - Ok, so turns out I made a simple mistake in the code, it is fixed now. Now the testcase gets this error message: mysqltest: At line 26: query 'CHANGE MASTER TO master_gtid_pos=''' failed: 1947: Requested MASTER_GTID_POS contains no value for replication domain 0. This conflicts with the binary log which contains GTID 0-2-2. To use the requested MASTER_GTID_POS, the old binlog must be removed with RESET MASTER to avoid out-of-order binlog So this is the new testcase: --connection master CREATE TABLE t1 (i INT); --sync_slave_with_master --source include/stop_slave.inc DROP TABLE t1; RESET SLAVE; --error ER_MASTER_GTID_POS_MISSING_DOMAIN eval CHANGE MASTER TO master_gtid_pos=''; RESET MASTER; eval CHANGE MASTER TO master_gtid_pos=''; --source include/start_slave.inc --sleep 1 query_vertical SHOW ALL SLAVES STATUS; SELECT * FROM t1; So I've pushed this fix. However, I'm still open to discussing the deeper issue of whether this is the best way to handle things. It was a fundamental design decision I made early that I wanted the slave GTID state to be just a position in the binlog (or one per replication domain) - not a set of all applied GTIDs, like in the MySQL 5.6 design. This makes things simpler for the user, but it also gives the user a great responsibility: to ensure that binlogs are identical on all servers that can at some point become a master. Because GTID promises to allow to put any server as a slave of any other server with just MASTER_GTID_POS=AUTO. And the only thing the slave knows is the single GTID to start at. So starting from this GTID has to return the exact same sequence of events, no matter what master server is selected. Otherwise inconsistent/incorrect replication will result. So the lesson I took from your previous extensive feedback was to try much harder to protect the user from mistakes with inconsistent binlogs and configurations, and give errors in many more cases. Like in this one, unfortunately I missed the case MASTER_GTID_POS='', but fortunately you caught it immediately. Basically, with GTID, you can no longer do local changes on the slave without thinking about what goes into the slave binlog. Because only the current master within a domain is allowed to write to the binlog. One needs to do such local changes with SQL_LOG_BIN=0 if they are not meant to be replicated elsewhere, or clean them up afterwards with RESET MASTER (on the slave). There is definitely still need for improvement with this and the user interface in general. So I suggest we continue the discussion in the context of your excellent analysis in MDEV-4325 .

                People

                • Assignee:
                  knielsen Kristian Nielsen
                  Reporter:
                  elenst Elena Stepanova
                • Votes:
                  0 Vote for this issue
                  Watchers:
                  1 Start watching this issue

                  Dates

                  • Created:
                    Updated:
                    Resolved:

                    Time Tracking

                    Estimated:
                    Original Estimate - Not Specified
                    Not Specified
                    Remaining:
                    Remaining Estimate - 0 minutes
                    0m
                    Logged:
                    Time Spent - 2 hours
                    2h