Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-26 Global transaction ID
  3. MDEV-4325

Relation between GTID_POS and RESET SLAVE [ALL] / CHANGE MASTER TO

    Details

    • Type: Technical task
    • Status: Closed
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      The provided test case

      • starts master=>slave replication from scratch, using gtid_pos=auto;
      • executes 3 events on master;
      • waits till slave synchronizes with master;
      • stops replication;
      • resets slave and master;
      • executes a few events on master;
      • starts master=>slave replication from scratch, using gtid_pos=auto

      The slave attempts to start from the 4th event. Depending on the nature of the events and the exact number of the "few" events in the second round, it might result either in a replication failure, or with "fatal error 1236 from master when reading data from binary log: 'Error: connecting slave requested to start from GTID 0-1-3, which is not in the master's binlog'", or in silent ignoring of the first events in the new master binlog.

      
      --source include/master-slave.inc
      --source include/have_innodb.inc
      --source include/have_binlog_format_mixed.inc
      
      --echo ################
      --echo # Do it once...
      --echo ################
      
      --connection slave
      --source include/stop_slave.inc
      RESET SLAVE ALL;
      
      --connection master
      RESET MASTER;
      CREATE TABLE t1 (pk INT PRIMARY KEY);
      DROP TABLE t1;
      --save_master_pos
      
      --connection slave
      eval CHANGE MASTER TO master_host='127.0.0.1', master_port=$MASTER_MYPORT, master_user='root', master_gtid_pos=auto;
      --source include/start_slave.inc
      --sync_with_master
      
      --echo ################
      --echo # Do it twice...
      --echo ################
      
      --source include/stop_slave.inc
      RESET SLAVE ALL;
      
      --connection master
      RESET MASTER;
      CREATE TABLE t1 (pk INT PRIMARY KEY);
      INSERT INTO t1 VALUES (1);
      INSERT INTO t1 VALUES (2);
      --save_master_pos
      
      --connection slave
      eval CHANGE MASTER TO master_host='127.0.0.1', master_port=$MASTER_MYPORT, master_user='root', master_gtid_pos=auto;
      --source include/start_slave.inc
      --sync_with_master
      
      revision-id: knielsen@knielsen-hq.org-20130322102628-hxohewmbfyd1wig6
      revno: 3538
      branch-nick: 10.0-mdev26
      

        Gliffy Diagrams

          Attachments

            Issue Links

              Activity

              Hide
              knielsen Kristian Nielsen added a comment -

              To reset the GTID state, one currently must do CHANGE MASTER TO MASTER_GTID_POS=''

              I agree that it is natural to think that RESET SLAVE [ALL] would reset the
              GTID state. But that would not be correct. Thingk about multi-source, with
              several slaveconnections configured. The GTID state is shared for all
              configured slaves, so if RESET SLAVE master_connection_1 ALL were to nuke the
              GTID state, then it would corrupt any running master_connection_2 slave
              connection.

              Maybe we could implement RESET ALL SLAVES that would also reset the GTID
              position. Or maybe the GTID should not be set by CHANGE MASTER, since it is
              global?

              In any case, I very much welcome suggestions for improving this. What do you
              think the correct behaviour should be? I remember being hit by this myself,
              but not immediately being able to decide how to make it better ...

              Show
              knielsen Kristian Nielsen added a comment - To reset the GTID state, one currently must do CHANGE MASTER TO MASTER_GTID_POS='' I agree that it is natural to think that RESET SLAVE [ALL] would reset the GTID state. But that would not be correct. Thingk about multi-source, with several slaveconnections configured. The GTID state is shared for all configured slaves, so if RESET SLAVE master_connection_1 ALL were to nuke the GTID state, then it would corrupt any running master_connection_2 slave connection. Maybe we could implement RESET ALL SLAVES that would also reset the GTID position. Or maybe the GTID should not be set by CHANGE MASTER, since it is global? In any case, I very much welcome suggestions for improving this. What do you think the correct behaviour should be? I remember being hit by this myself, but not immediately being able to decide how to make it better ...
              Hide
              elenst Elena Stepanova added a comment -

              I'm already thinking about it, as I found out that in case of multi-source replication it didn't work as i expected it, either (or maybe multi-source variant just hasn't been implemented yet?)
              In any case, I need to experiment more with single- and multi-source before I can come up with suggestions.

              Show
              elenst Elena Stepanova added a comment - I'm already thinking about it, as I found out that in case of multi-source replication it didn't work as i expected it, either (or maybe multi-source variant just hasn't been implemented yet?) In any case, I need to experiment more with single- and multi-source before I can come up with suggestions.
              Hide
              elenst Elena Stepanova added a comment -

              I'm changing it to a task for now since it's clearly not a straightforward bug but a separate topic that requires discussion and consideration. I think it should be convenient enough to have it inside a JIRA issue since it's public and is easier to watch than e-mails.

              Show
              elenst Elena Stepanova added a comment - I'm changing it to a task for now since it's clearly not a straightforward bug but a separate topic that requires discussion and consideration. I think it should be convenient enough to have it inside a JIRA issue since it's public and is easier to watch than e-mails.
              Hide
              elenst Elena Stepanova added a comment -

              So, I experimented a bit, trying to abstract myself from implementation details and imagine possible user expectations. And, while I understand the technical challenge and reasoning, something feels inconsistent about CHANGE MASTER / RESET SLAVE behavior in regard to GTID. Here is some initial contemplation on the subject, no good suggestions yet.

              Part 1
              ------

              I am User1, whose setup never involves multi-source, it's just plain master=>slave. Of course, you never know, maybe I'll need to switch them one day and want to be prepared to it, or I just like new cool things – anyway, I decide to use GTID.

              I start a fresh new pair and configure the slave
              CHANGE MASTER TO master_host= ..., ..., master_gtid_pos='';
              (or master_gtid_pos=auto, it shouldn't matter at this point, right?)

              For me master_gtid_pos is a parameter which defines the replication position – same way master_log_pos and master_log_file did before, so it's quite natural to have it in CHANGE MASTER (actually I don't know why I should provide it – I don't have to set default values of master_log_file/pos, but maybe it's because I need to indicate I want to use GTID now).

              I start replication, it goes on for a while, then something bad happens. I still have all master binlogs, so it's not a big deal, I can start over; I've been there.

              RESET SLAVE is supposed to do just that, it is defined as
              "makes the slave forget its replication position in the master's binary log. This statement is meant to be used for a clean start",
              and it used to do just that; but it doesn't anymore.

              The problem is, I might not even notice that RESET SLAVE didn't work, so I won't start looking for alternative solutions.
              Lets say I had on master

              create table t1 (i int);
              insert into t1 values (1);

              Slave synchronized with master, so it also has t1.
              Then I've done something bad, and decided to start replication over.
              I stopped slave, dropped t1, ran RESET SLAVE, and started it again, as I'd always done before.
              If the next statement on master does something with t1, my replication will abort (the table doesn't exist), so I will at least know about the problem.
              But if master continues with a different table
              create table t2 (i int)
              and keeps working with it, I might never know that I don't have t1 on slave anymore – until it's too late (master died, binlogs are gone, etc.)

              So, sooner or later when I detect the problem, I start googling and find out that instead of RESET SLAVE I apparently need to do
              CHANGE MASTER TO master_gtid_pos = '';

              That's weird, RESET SLAVE [ALL] is perceived as a reverse command for CHANGE MASTER; so if it's okay to set master_gtid_pos via CHANGE MASTER (and we already decided that it is), then it should be reverted by RESET SLAVE...
              It's not obvious, but we used to change master_log_pos/master_log_file to non-default values via CHANGE MASTER, so it's at least comprehensible (except that it doesn't work, see MDEV-4312; but I presume it's a bug, and will be fixed).

              Now, back to our side for a minute: we have clearly changed semantics of RESET SLAVE: earlier it would make slave forget the position, now (with GTID) it doesn't. But what does it do, then? I mean, yes, it still resets master_log_pos and master_log_file, but what is the meaning of it, apart from desynchronizing remaining GTID position and master log position?

              Part 2
              ------

              It's basically the same as the story for User1, only here I didn't do anything bad, I just at some point decided to move my master server to another host. Slave is fully synchronized, backups are in place, so I just stop replication, shut down master, move the data files (but not binlogs, I don't need them) to the new host, start master – effectively, it's the same as RESET MASTER. I do RESET SLAVE ALL since I need to forget both the position and connection parameters, set up replication again, start slave...
              The rest is the same as in Part 1 – depending on my luck with the correlation between the stored (and not reset) GTID position and current position on the new master, the replication will either abort (good case), or will continue from some random place, having skipped some events.

              Part 3
              ------

              As a User3, I want to create a multi-source setup.
              I configured m1 as
              CHANGE MASTER 'm1' master_host=.., ..., master_gtid_pos=''
              started the slave 'm1', it has been working for a while for now.
              Now I want to add another master. I do exactly the same: I run
              CHANGE MASTER 'm2' ... master_gtid_pos='';
              but the slave refuses to run it saying that my slave is running. I find it weird and inconvenient – since when do I need to stop one slave to configure another one? – but it is what it is; so, I stop slave 'm1', re-run CHANGE MASTER 'm2' ... master_gtid_pos='' , start both slaves...

              and a disaster happens, my data gets all messed up, because m1 has started from the beginning.
              I didn't plan on that – how could I? I know that CHANGE MASTER 'm2' sets parameters for slave m2, it has never touched anything global.

              One has to learn the hard way, so I fix the data, restart m1, configure m2 with master_gtid_pos=auto (it should work, right?).
              Both m1 and m2 work for a while.

              Then I become User1 or User2 in regard to one of my slaves. Lets say I want to make m2 start from the beginning. How do I do that?
              First, I go through all the same confusion regarding RESET SLAVE 'm2' – and as you said, thank goodness it doesn't reset GTID position, otherwise it would have been another disaster.

              But even after I figured out RESET SLAVE is not my command anymore, how do I actually do it?
              I can't set an explicit master_gtid_pos (to the position of m1) because it requires stopping m1, and even if it didn't, it would have been extremely dangerous since while I'm doing it the m1's position could have changed. So, the only way is to actually stop m1, then configure m2, and then start both. It's already sad, but will be even sadder if I have 10 sources, or 20...

              So, unlike for User1, for User3 it doesn't look natural at all to set master_gtid_pos via CHANGE MASTER command, since it's not a local slave parameter. Instead, User3 needs a way to tell one slave to start from a particular position without affecting other slaves... But it doesn't seem possible with GTID, does it?

              Show
              elenst Elena Stepanova added a comment - So, I experimented a bit, trying to abstract myself from implementation details and imagine possible user expectations. And, while I understand the technical challenge and reasoning, something feels inconsistent about CHANGE MASTER / RESET SLAVE behavior in regard to GTID. Here is some initial contemplation on the subject, no good suggestions yet. Part 1 ------ I am User1, whose setup never involves multi-source, it's just plain master=>slave. Of course, you never know, maybe I'll need to switch them one day and want to be prepared to it, or I just like new cool things – anyway, I decide to use GTID. I start a fresh new pair and configure the slave CHANGE MASTER TO master_host= ..., ..., master_gtid_pos=''; (or master_gtid_pos=auto, it shouldn't matter at this point, right?) For me master_gtid_pos is a parameter which defines the replication position – same way master_log_pos and master_log_file did before, so it's quite natural to have it in CHANGE MASTER (actually I don't know why I should provide it – I don't have to set default values of master_log_file/pos, but maybe it's because I need to indicate I want to use GTID now). I start replication, it goes on for a while, then something bad happens. I still have all master binlogs, so it's not a big deal, I can start over; I've been there. RESET SLAVE is supposed to do just that, it is defined as "makes the slave forget its replication position in the master's binary log. This statement is meant to be used for a clean start", and it used to do just that; but it doesn't anymore. The problem is, I might not even notice that RESET SLAVE didn't work, so I won't start looking for alternative solutions. Lets say I had on master create table t1 (i int); insert into t1 values (1); Slave synchronized with master, so it also has t1. Then I've done something bad, and decided to start replication over. I stopped slave, dropped t1, ran RESET SLAVE, and started it again, as I'd always done before. If the next statement on master does something with t1, my replication will abort (the table doesn't exist), so I will at least know about the problem. But if master continues with a different table create table t2 (i int) and keeps working with it, I might never know that I don't have t1 on slave anymore – until it's too late (master died, binlogs are gone, etc.) So, sooner or later when I detect the problem, I start googling and find out that instead of RESET SLAVE I apparently need to do CHANGE MASTER TO master_gtid_pos = ''; That's weird, RESET SLAVE [ALL] is perceived as a reverse command for CHANGE MASTER; so if it's okay to set master_gtid_pos via CHANGE MASTER (and we already decided that it is), then it should be reverted by RESET SLAVE... It's not obvious, but we used to change master_log_pos/master_log_file to non-default values via CHANGE MASTER, so it's at least comprehensible (except that it doesn't work, see MDEV-4312 ; but I presume it's a bug, and will be fixed). Now, back to our side for a minute: we have clearly changed semantics of RESET SLAVE: earlier it would make slave forget the position, now (with GTID) it doesn't. But what does it do, then? I mean, yes, it still resets master_log_pos and master_log_file, but what is the meaning of it, apart from desynchronizing remaining GTID position and master log position? Part 2 ------ It's basically the same as the story for User1, only here I didn't do anything bad, I just at some point decided to move my master server to another host. Slave is fully synchronized, backups are in place, so I just stop replication, shut down master, move the data files (but not binlogs, I don't need them) to the new host, start master – effectively, it's the same as RESET MASTER. I do RESET SLAVE ALL since I need to forget both the position and connection parameters, set up replication again, start slave... The rest is the same as in Part 1 – depending on my luck with the correlation between the stored (and not reset) GTID position and current position on the new master, the replication will either abort (good case), or will continue from some random place, having skipped some events. Part 3 ------ As a User3, I want to create a multi-source setup. I configured m1 as CHANGE MASTER 'm1' master_host=.., ..., master_gtid_pos='' started the slave 'm1', it has been working for a while for now. Now I want to add another master. I do exactly the same: I run CHANGE MASTER 'm2' ... master_gtid_pos=''; but the slave refuses to run it saying that my slave is running. I find it weird and inconvenient – since when do I need to stop one slave to configure another one? – but it is what it is; so, I stop slave 'm1', re-run CHANGE MASTER 'm2' ... master_gtid_pos='' , start both slaves... and a disaster happens, my data gets all messed up, because m1 has started from the beginning. I didn't plan on that – how could I? I know that CHANGE MASTER 'm2' sets parameters for slave m2 , it has never touched anything global. One has to learn the hard way, so I fix the data, restart m1, configure m2 with master_gtid_pos=auto (it should work, right?). Both m1 and m2 work for a while. Then I become User1 or User2 in regard to one of my slaves. Lets say I want to make m2 start from the beginning. How do I do that? First, I go through all the same confusion regarding RESET SLAVE 'm2' – and as you said, thank goodness it doesn't reset GTID position, otherwise it would have been another disaster. But even after I figured out RESET SLAVE is not my command anymore, how do I actually do it? I can't set an explicit master_gtid_pos (to the position of m1) because it requires stopping m1, and even if it didn't, it would have been extremely dangerous since while I'm doing it the m1's position could have changed. So, the only way is to actually stop m1, then configure m2, and then start both. It's already sad, but will be even sadder if I have 10 sources, or 20... So, unlike for User1, for User3 it doesn't look natural at all to set master_gtid_pos via CHANGE MASTER command, since it's not a local slave parameter. Instead, User3 needs a way to tell one slave to start from a particular position without affecting other slaves... But it doesn't seem possible with GTID, does it?
              Hide
              knielsen Kristian Nielsen added a comment -

              > So, I experimented a bit, trying to abstract myself from implementation
              > details and imagine possible user expectations.

              Excellent analysis! It helped me a lot to get a better overview of where we
              are.

              > I start a fresh new pair and configure the slave
              > CHANGE MASTER TO master_host= ..., ..., master_gtid_pos='';
              > (or master_gtid_pos=auto, it shouldn't matter at this point, right?)

              Right.

              > For me master_gtid_pos is a parameter which defines the replication position
              > – same way master_log_pos and master_log_file did before, so it's quite
              > natural to have it in CHANGE MASTER (actually I don't know why I should
              > provide it – I don't have to set default values of master_log_file/pos, but
              > maybe it's because I need to indicate I want to use GTID now).

              Yes, it is to indicate using GTID.

              Actually, you do have to set default values of master_log_file/pos in normal
              replication, it is a mis-feature that one can omit it. Because if master has
              purged any binlogs, you get to start from whatever random position is the
              first non-purged file - which will certainly and silently corrupt your
              replication.

              It is quite deep in the design that GTID state is a global property of the
              server, not a per-slave-connection position. This is needed for example for
              multi-source. It is possible with MASTER_GTID_POS=AUTO to switch eg. from
              having two masters to having a single master that itself replicates from the
              original two masters. Do you think it will be possible to explain this to
              users, or is it hopelessly complicated and will need to be re-designed
              completely?

              Now with your analysis, I am thinking that I did this incorrectly with CHANGE
              MASTER and GTID. Maybe it should instead be like this:

              • A new command CHANGE GTID TO "0-1-2". This requires all slaves to be
                stopped. It replaces CHANGE MASTER TO MASTER_GTID_POS="0-1-2".
              • A new command SHOW GTID STATUS, replaces the Gtid_Pos field in SHOW ALL
                SLAVES STATUS.
              • In CHANGE MASTER, one must now do MASTER_USE_GTID=1. This gives an error if
                no GTID position is set (either manually with CHANGE GTID, or downloaded
                automatically by connecting slave to master with old non-GTID position).

              This makes it clear that GTID state is global on the server, separate from any
              slave connection configuration. And clear that the individual slave connection
              can be using GTID to connect (MASTER_USE_GTID=1) or old style position
              (MASTER_USE_GTID=0).

              What do you think? I now understand that this is how I meant things to work,
              though I never formulated it explicitly like this before.

              > RESET SLAVE is supposed to do just that, it is defined as
              > "makes the slave forget its replication position in the master's binary log. This statement is meant to be used for a clean start",
              > and it used to do just that; but it doesn't anymore.

              I just read the documentation, indeed that is what it says. But it's rubbish,
              isn't it? Except for toy setups where one keeps all binlogs on the master
              forever, it doesn't work. Or am I missing something?

              But there is clearly a bug here! RESET SLAVE should remove Using_Gtid, it does
              not, shame on me. I've fixed and pushed.

              Now, if user does RESET SLAVE and then START SLAVE, things will
              "work". Replication will start from the first binlog file on the master,
              without using Gtid.

              > I stopped slave, dropped t1, ran RESET SLAVE, and started it again, as I'd always done before.
              > If the next statement on master does something with t1, my replication will abort (the table doesn't exist), so I will at least know about the problem.

              Right, this was a bug, fixed now.

              Now, replication will start without using GTID, from the first binlog file on
              the master. If some binlogs were purged, the same silent corruption may occur.
              If all binlogs were kept on the master, things will be ok, but it will no
              longer be using GTID.

              I won't say this is good behaviour, but it at least seems consistent with how
              it worked before. Or what do you think?

              > But if master continues with a different table
              > create table t2 (i int)
              > and keeps working with it, I might never know that I don't have t1 on slave anymore – until it's too late (master died, binlogs are gone, etc.)

              Yeah. I would prefer giving an error in case no position specified, but that
              is probably out due to backwards compatibility?

              At least, if we can educate the user that GTID state is set separately with
              CHANGE GTID, it should be clearer that CHANGE MASTER MASTER_USE_GTID=1 starts
              from whatever SHOW GTID STATUS displays, and that RESET MASTER reverst to
              MASTER_USE_GTID=0.

              > CHANGE MASTER TO master_gtid_pos = '';
              >
              > That's weird, RESET SLAVE [ALL] is perceived as a reverse command for CHANGE MASTER.

              Yes, it is wierd. Just as Gtid_Pos in SHOW ALL SLAVES STATUS is wierd, because
              it is not per-slave it is global.

              Let me hear your opinion on CHANGE GTID / SHOW GTID STATUS / MASTER_USE_GTID,
              and if we agree then I will change implementation to that.

              > Now, back to our side for a minute: we have clearly changed semantics of
              > RESET SLAVE: earlier it would make slave forget the position, now (with
              > GTID) it doesn't. But what does it do, then?

              With the above bug fixed, now it sets also Using_Gtid=0.

              > It's basically the same as the story for User1, only here I didn't do
              > anything bad, I just at some point decided to move my master server to
              > another host. Slave is fully synchronized, backups are in place, so I just
              > stop replication, shut down master, move the data files (but not binlogs, I
              > don't need them) to the new host, start master – effectively, it's the same
              > as RESET MASTER. I do RESET SLAVE ALL since I need to forget both the
              > position and connection parameters, set up replication again, start slave...

              With the above bug fixed, things should work, but you will no longer be using
              GTID.

              If you add MASTER_GTID_POS=AUTO to the CHANGE MASTER command, you should get
              an error that master is missing the GTID requested by the slave. But user
              needs to be aware that RESET MASTER (or your above equivalent) is dangerous
              with GTID. Because it starts GTID generation from scratch, so now you have
              duplicate GTIDs in your system, unless you carefully remove the old ones
              everywhere. At least you get an error message in most cases rather than silent
              corruption.

              Once you see the error and issue CHANGE MASTER TO MASTER_GTID_POS='' (or
              CHANGE GTID TO ''), things should work again.

              The "recommended" way to do the above would be to copy the binlog files along
              also (maybe purge all logs but the latest first). Then there would be no need
              for RESET SLAVE, just CHANGE MASTER TO the new host and port, and GTID would
              connect automatically at the correct position (that's the whole point of GTID,
              to find position automatically on new master, right?). Of course this is
              untested, but it should work, I will add a test case for this.

              Does that sound ok? Any suggestions for improvement?

              > As a User3, I want to create a multi-source setup.
              > I configured m1 as
              > CHANGE MASTER 'm1' master_host=.., ..., master_gtid_pos=''
              > started the slave 'm1', it has been working for a while for now.
              > Now I want to add another master. I do exactly the same: I run
              > CHANGE MASTER 'm2' ... master_gtid_pos='';

              You do not need to specify master_gtid_pos='' in the second CHANGE
              MASTER. This will be clearer with the change to CHANGE GTID:

              CHANGE GTID TO '';
              CHANGE MASTER 'm1' ... master_using_gtid=1;
              CHANGE MASTER 'm2' ... master_using_gtid=1;

              > One has to learn the hard way, so I fix the data, restart m1, configure m2 with master_gtid_pos=auto (it should work, right?).

              Yes.

              > Then I become User1 or User2 in regard to one of my slaves. Lets say I want to make m2 start from the beginning. How do I do that?

              First, to use multi-source with GTID, you have to setup the two different
              masters with different domain ids. Let's say gtid_domain_id=1 for m1 and
              gtid_domain_id=2 for m2.

              Then you need to get the current GTID state, using SHOW ALL SLAVES STATUS
              (SHOW GTID STATUS). Let's say it is "1-10-100,2-11-200".

              Now you want to start from the beginning of domain 2 (the domain of m2). So
              you need to remove that domain from the state:

              CHANGE MASTER TO MASTER_GTID_POS="1-10-100"

              (or CHANGE GTID TO "1-10-100").

              Alternatively, you can start m2 slave from the start of the m2 binlogs,
              without using GTID:

              CHANGE MASTER 'm2' TO master_log_file='', master_log_pos=0;

              Then it will download the correct gtid position and update it
              automatically. Then the next time you change master for m2 you can use
              MASTER_GTID_POS=AUTO again. It would be nice if I could implement that one
              could ask to connect the first time with old-style position, but then the next
              time with GTID.

              > It's already sad, but will be even sadder if I have 10 sources, or 20...

              Yes, perhaps a bit sad. I did at one point consider that MASTER_GTID_POS would
              only change the domains mentioned, and leave all other domains intact. And one
              would need to set seq_no to zero to remove a domain
              (MASTER_GTID_POS="1-10-100,2-11-0"). But I thought that was too magic, and
              users could always specify the full GTID state if they wanted to keep some domains.

              Hm, a lot longer reply than I indended. But hopefully we are getting closer to
              something that is at least workable, if not as perfect as I had hoped
              initially ...

              Show
              knielsen Kristian Nielsen added a comment - > So, I experimented a bit, trying to abstract myself from implementation > details and imagine possible user expectations. Excellent analysis! It helped me a lot to get a better overview of where we are. > I start a fresh new pair and configure the slave > CHANGE MASTER TO master_host= ..., ..., master_gtid_pos=''; > (or master_gtid_pos=auto, it shouldn't matter at this point, right?) Right. > For me master_gtid_pos is a parameter which defines the replication position > – same way master_log_pos and master_log_file did before, so it's quite > natural to have it in CHANGE MASTER (actually I don't know why I should > provide it – I don't have to set default values of master_log_file/pos, but > maybe it's because I need to indicate I want to use GTID now). Yes, it is to indicate using GTID. Actually, you do have to set default values of master_log_file/pos in normal replication, it is a mis-feature that one can omit it. Because if master has purged any binlogs, you get to start from whatever random position is the first non-purged file - which will certainly and silently corrupt your replication. It is quite deep in the design that GTID state is a global property of the server, not a per-slave-connection position. This is needed for example for multi-source. It is possible with MASTER_GTID_POS=AUTO to switch eg. from having two masters to having a single master that itself replicates from the original two masters. Do you think it will be possible to explain this to users, or is it hopelessly complicated and will need to be re-designed completely? Now with your analysis, I am thinking that I did this incorrectly with CHANGE MASTER and GTID. Maybe it should instead be like this: A new command CHANGE GTID TO "0-1-2". This requires all slaves to be stopped. It replaces CHANGE MASTER TO MASTER_GTID_POS="0-1-2". A new command SHOW GTID STATUS, replaces the Gtid_Pos field in SHOW ALL SLAVES STATUS. In CHANGE MASTER, one must now do MASTER_USE_GTID=1. This gives an error if no GTID position is set (either manually with CHANGE GTID, or downloaded automatically by connecting slave to master with old non-GTID position). This makes it clear that GTID state is global on the server, separate from any slave connection configuration. And clear that the individual slave connection can be using GTID to connect (MASTER_USE_GTID=1) or old style position (MASTER_USE_GTID=0). What do you think? I now understand that this is how I meant things to work, though I never formulated it explicitly like this before. > RESET SLAVE is supposed to do just that, it is defined as > "makes the slave forget its replication position in the master's binary log. This statement is meant to be used for a clean start", > and it used to do just that; but it doesn't anymore. I just read the documentation, indeed that is what it says. But it's rubbish, isn't it? Except for toy setups where one keeps all binlogs on the master forever, it doesn't work. Or am I missing something? But there is clearly a bug here! RESET SLAVE should remove Using_Gtid, it does not, shame on me. I've fixed and pushed. Now, if user does RESET SLAVE and then START SLAVE, things will "work". Replication will start from the first binlog file on the master, without using Gtid. > I stopped slave, dropped t1, ran RESET SLAVE, and started it again, as I'd always done before. > If the next statement on master does something with t1, my replication will abort (the table doesn't exist), so I will at least know about the problem. Right, this was a bug, fixed now. Now, replication will start without using GTID, from the first binlog file on the master. If some binlogs were purged, the same silent corruption may occur. If all binlogs were kept on the master, things will be ok, but it will no longer be using GTID. I won't say this is good behaviour, but it at least seems consistent with how it worked before. Or what do you think? > But if master continues with a different table > create table t2 (i int) > and keeps working with it, I might never know that I don't have t1 on slave anymore – until it's too late (master died, binlogs are gone, etc.) Yeah. I would prefer giving an error in case no position specified, but that is probably out due to backwards compatibility? At least, if we can educate the user that GTID state is set separately with CHANGE GTID, it should be clearer that CHANGE MASTER MASTER_USE_GTID=1 starts from whatever SHOW GTID STATUS displays, and that RESET MASTER reverst to MASTER_USE_GTID=0. > CHANGE MASTER TO master_gtid_pos = ''; > > That's weird, RESET SLAVE [ALL] is perceived as a reverse command for CHANGE MASTER. Yes, it is wierd. Just as Gtid_Pos in SHOW ALL SLAVES STATUS is wierd, because it is not per-slave it is global. Let me hear your opinion on CHANGE GTID / SHOW GTID STATUS / MASTER_USE_GTID, and if we agree then I will change implementation to that. > Now, back to our side for a minute: we have clearly changed semantics of > RESET SLAVE: earlier it would make slave forget the position, now (with > GTID) it doesn't. But what does it do, then? With the above bug fixed, now it sets also Using_Gtid=0. > It's basically the same as the story for User1, only here I didn't do > anything bad, I just at some point decided to move my master server to > another host. Slave is fully synchronized, backups are in place, so I just > stop replication, shut down master, move the data files (but not binlogs, I > don't need them) to the new host, start master – effectively, it's the same > as RESET MASTER. I do RESET SLAVE ALL since I need to forget both the > position and connection parameters, set up replication again, start slave... With the above bug fixed, things should work, but you will no longer be using GTID. If you add MASTER_GTID_POS=AUTO to the CHANGE MASTER command, you should get an error that master is missing the GTID requested by the slave. But user needs to be aware that RESET MASTER (or your above equivalent) is dangerous with GTID. Because it starts GTID generation from scratch, so now you have duplicate GTIDs in your system, unless you carefully remove the old ones everywhere. At least you get an error message in most cases rather than silent corruption. Once you see the error and issue CHANGE MASTER TO MASTER_GTID_POS='' (or CHANGE GTID TO ''), things should work again. The "recommended" way to do the above would be to copy the binlog files along also (maybe purge all logs but the latest first). Then there would be no need for RESET SLAVE, just CHANGE MASTER TO the new host and port, and GTID would connect automatically at the correct position (that's the whole point of GTID, to find position automatically on new master, right?). Of course this is untested, but it should work, I will add a test case for this. Does that sound ok? Any suggestions for improvement? > As a User3, I want to create a multi-source setup. > I configured m1 as > CHANGE MASTER 'm1' master_host=.., ..., master_gtid_pos='' > started the slave 'm1', it has been working for a while for now. > Now I want to add another master. I do exactly the same: I run > CHANGE MASTER 'm2' ... master_gtid_pos=''; You do not need to specify master_gtid_pos='' in the second CHANGE MASTER. This will be clearer with the change to CHANGE GTID: CHANGE GTID TO ''; CHANGE MASTER 'm1' ... master_using_gtid=1; CHANGE MASTER 'm2' ... master_using_gtid=1; > One has to learn the hard way, so I fix the data, restart m1, configure m2 with master_gtid_pos=auto (it should work, right?). Yes. > Then I become User1 or User2 in regard to one of my slaves. Lets say I want to make m2 start from the beginning. How do I do that? First, to use multi-source with GTID, you have to setup the two different masters with different domain ids. Let's say gtid_domain_id=1 for m1 and gtid_domain_id=2 for m2. Then you need to get the current GTID state, using SHOW ALL SLAVES STATUS (SHOW GTID STATUS). Let's say it is "1-10-100,2-11-200". Now you want to start from the beginning of domain 2 (the domain of m2). So you need to remove that domain from the state: CHANGE MASTER TO MASTER_GTID_POS="1-10-100" (or CHANGE GTID TO "1-10-100"). Alternatively, you can start m2 slave from the start of the m2 binlogs, without using GTID: CHANGE MASTER 'm2' TO master_log_file='', master_log_pos=0; Then it will download the correct gtid position and update it automatically. Then the next time you change master for m2 you can use MASTER_GTID_POS=AUTO again. It would be nice if I could implement that one could ask to connect the first time with old-style position, but then the next time with GTID. > It's already sad, but will be even sadder if I have 10 sources, or 20... Yes, perhaps a bit sad. I did at one point consider that MASTER_GTID_POS would only change the domains mentioned, and leave all other domains intact. And one would need to set seq_no to zero to remove a domain (MASTER_GTID_POS="1-10-100,2-11-0"). But I thought that was too magic, and users could always specify the full GTID state if they wanted to keep some domains. Hm, a lot longer reply than I indended. But hopefully we are getting closer to something that is at least workable, if not as perfect as I had hoped initially ...
              Hide
              knielsen Kristian Nielsen added a comment -

              > Of course this is untested, but it should work, I will add a test case for
              > this.

              And of course this did not work. I'm fixing right now.

              • Kristian.
              Show
              knielsen Kristian Nielsen added a comment - > Of course this is untested, but it should work, I will add a test case for > this. And of course this did not work. I'm fixing right now. Kristian.
              Hide
              knielsen Kristian Nielsen added a comment -

              > > Of course this is untested, but it should work, I will add a test case for
              > > this.

              > And of course this did not work. I'm fixing right now.

              I pushed a fix for this. Test case at the end of rpl_gtid_startpos.test.

              Show
              knielsen Kristian Nielsen added a comment - > > Of course this is untested, but it should work, I will add a test case for > > this. > And of course this did not work. I'm fixing right now. I pushed a fix for this. Test case at the end of rpl_gtid_startpos.test.
              Hide
              elenst Elena Stepanova added a comment -

              >> Do you think it will be possible to explain this to
              >> users, or is it hopelessly complicated and will need to be re-designed
              >> completely?

              I have no doubt that it will be possible to explain everything to users who are planning to run complicated configurations or workflow (switching servers on regular basis, etc.). I'm more concerned about the part of the user base who run simple straightforward replication, and the most they might do is to promote the slave as a new master in case of a crash. I expect it to be the majority, and want to be sure that we don't make their life harder, and even more so that we don't put them in a situation where they are likely to make a critical mistake just because they do stuff as they used to, while we changed the way things work. I expect this category of users won't read deep into the GTID documentation, exactly because they don't need the complicated setup; they are likely to follow instructions similar to 'First steps' or 'Quick setup'. If we manage to eventually explain the important things in a few words, then we should be fine. I know that so far we are not quite there yet, because even although I'm trying to understand things, I keep making mistakes which could have been fatal for a production environment.

              >> Now with your analysis, I am thinking that I did this incorrectly with CHANGE
              >> MASTER and GTID. Maybe it should instead be like this:
              >>
              >> - A new command CHANGE GTID TO "0-1-2". This requires all slaves to be
              >> stopped. It replaces CHANGE MASTER TO MASTER_GTID_POS="0-1-2".
              >>
              >> - A new command SHOW GTID STATUS, replaces the Gtid_Pos field in SHOW ALL
              >> SLAVES STATUS.

              Do we really need the new syntax? I'd think, if GTID position is a global value, we could just make it a global dynamic variable. Then, SHOW GTID STATUS would also be not needed, since it would only return a single value – we can just as well do SHOW VARIABLES or SELECT @@gtid_position (or whatever it's called). Is there any reason why it wouldn't work?

              >> replaces the Gtid_Pos field in SHOW ALL
              >> SLAVES STATUS.

              I think that showing the value in SHOW ALL SLAVES STATUS doesn't hurt, and maybe even beneficial from the usability perspective, so, if it comes for a low price, it could stay there as well.

              >> This makes it clear that GTID state is global on the server, separate from any
              >> slave connection configuration. And clear that the individual slave connection
              >> can be using GTID to connect (MASTER_USE_GTID=1) or old style position
              >> (MASTER_USE_GTID=0).
              >>
              >> What do you think? I now understand that this is how I meant things to work,
              >> though I never formulated it explicitly like this before.

              Yes, if my current understanding of how things are meant to work is any close to the truth, the proposed changes sound quite logical.

              >> > RESET SLAVE is supposed to do just that, it is defined as
              >> > "makes the slave forget its replication position in the master's binary log. This statement is meant to be used for a clean start",
              >> > and it used to do just that; but it doesn't anymore.

              >> I just read the documentation, indeed that is what it says. But it's rubbish,
              >> isn't it?

              Possibly, but that's how things used to work, and I'm pretty sure a number of people used it in their own, however tricky, ways – either in "toy" (in fact, just low-traffic) setups, or in conjuction with RESET MASTER (on master), etc. It wouldn't be very kind to make radical changes in the way things are supposed to work, especially because it's not easy to explain on high level why the algorithm has to be different with and without GTID.

              >> I won't say this is good behaviour, but it at least seems consistent with how
              >> it worked before. Or what do you think?

              Yes, I think it's better to keep it consistent with the old behavior for the time being.
              I might get back to you regarding this after I have tried it (I didn't check the new version yet).

              >> I would prefer giving an error in case no position specified, but that
              >> is probably out due to backwards compatibility?

              Personally, I don't see a big tragedy in doing RESET SLAVE without providing a master position afterwards. I mean, it seems natural to consider it as a shortcut of master_pos/master_log_file=<start from the beginning of whatever we have>. The absence of massive complaints about slave starting after RESET from a non-zero position due to previously purged master binlogs indirectly confirms that people don't have a problem with this. So, I'd rather keep it as is.

              >> At least, if we can educate the user that GTID state is set separately with
              >> CHANGE GTID, it should be clearer that CHANGE MASTER MASTER_USE_GTID=1 starts
              >> from whatever SHOW GTID STATUS displays, and that RESET MASTER reverst to
              >> MASTER_USE_GTID=0.

              Right. Although, same way as in the previous note about old-style master position, I wouldn't find it wrong if we considered the "empty" GTID value default and use it if nothing else was previously set; but if you prefer insisting on always setting it, either manually or through automatic discovery, I don't have strong objections against it, either.

              >> Just as Gtid_Pos in SHOW ALL SLAVES STATUS is wierd, because
              >> it is not per-slave it is global.

              Hm.. Actually, I don't see anything weird in showing GTID position in SHOW ALL SLAVES STATUS (as opposed to SHOW SLAVE STATUS), exactly because it's global for all slaves.
              (Of course, it becomes somewhat strange that SHOW ALL SLAVES STATUS is not the same as SHOW SLAVE STATUS when we only have one slave, but that's another story).

              >> > decided to move my master server to
              >> > another host. Slave is fully synchronized, backups are in place, so I just
              >> > stop replication, shut down master, move the data files (but not binlogs, I
              >> > don't need them) to the new host, start master – effectively, it's the same
              >> > as RESET MASTER. I do RESET SLAVE ALL since I need to forget both the
              >> > position and connection parameters, set up replication again, start slave...

              >> user
              >> needs to be aware that RESET MASTER (or your above equivalent) is dangerous
              >> with GTID. Because it starts GTID generation from scratch, so now you have
              >> duplicate GTIDs in your system, unless you carefully remove the old ones
              >> everywhere. At least you get an error message in most cases rather than silent
              >> corruption.

              >> Once you see the error and issue CHANGE MASTER TO MASTER_GTID_POS='' (or
              >> CHANGE GTID TO ''), things should work again.

              >> The "recommended" way to do the above would be to copy the binlog files along
              >> also (maybe purge all logs but the latest first). Then there would be no need
              >> for RESET SLAVE, just CHANGE MASTER TO the new host and port, and GTID would
              >> connect automatically at the correct position

              That's exactly the case when I'm concerned about owners of simple setups, and how things become somewhat more complicated for them, or at least different.
              I don't know what real users do in a situation like that, but if I were one of them, I would do exactly as I described, because it's simpler. This way, I don't need to do any purge on old master, I don't need to move an extra log (which might be quite big), I don't need to remember which exact parameters I must modify in CHANGE MASTER (what if I forget to change host? After RESET SLAVE ALL, I'll get a clear error, without RESET SLAVE ALL the slave will attempt to connect to the old master, and lucky me if there is nothing else running on that host/port at the moment; etc.).

              >> > As a User3, I want to create a multi-source setup.
              >> > I configured m1 as
              >> > CHANGE MASTER 'm1' master_host=.., ..., master_gtid_pos=''
              >> > started the slave 'm1', it has been working for a while for now.
              >> > Now I want to add another master. I do exactly the same: I run
              >> > CHANGE MASTER 'm2' ... master_gtid_pos='';

              >> You do not need to specify master_gtid_pos='' in the second CHANGE
              >> MASTER. This will be clearer with the change to CHANGE GTID:

              >> CHANGE GTID TO '';
              >> CHANGE MASTER 'm1' ... master_using_gtid=1;
              >> CHANGE MASTER 'm2' ... master_using_gtid=1;

              Yes, it's much clearer this way. My point was, I'd expect slaves to be symmetrical, while it was very much not so before.

              >> It would be nice if I could implement that one
              >> could ask to connect the first time with old-style position, but then the next
              >> time with GTID.

              Is it difficult to implement? Frankly, I thought that auto means pretty much that... Even more so if we have CHANGE MASTER .. master_using_gtid=1|0, where 1 throws an error when the GTID position is not set; then it would be logical to also have master_using_gtid=auto (or SET GLOBAL gtid = 'auto', whichever is more reasonable from implementation perspective), which would mean that the slave connects with an old-style position, acquires GTID position, sets it, and further connects using it.

              >> hopefully we are getting closer to
              >> something that is at least workable, if not as perfect as I had hoped
              >> initially ...

              You never know, maybe it turns out "perfect enough" at the end.. Although, of course, nothing is ever as perfect as we initially hope

              Show
              elenst Elena Stepanova added a comment - >> Do you think it will be possible to explain this to >> users, or is it hopelessly complicated and will need to be re-designed >> completely? I have no doubt that it will be possible to explain everything to users who are planning to run complicated configurations or workflow (switching servers on regular basis, etc.). I'm more concerned about the part of the user base who run simple straightforward replication, and the most they might do is to promote the slave as a new master in case of a crash. I expect it to be the majority, and want to be sure that we don't make their life harder, and even more so that we don't put them in a situation where they are likely to make a critical mistake just because they do stuff as they used to, while we changed the way things work. I expect this category of users won't read deep into the GTID documentation, exactly because they don't need the complicated setup; they are likely to follow instructions similar to 'First steps' or 'Quick setup'. If we manage to eventually explain the important things in a few words, then we should be fine. I know that so far we are not quite there yet, because even although I'm trying to understand things, I keep making mistakes which could have been fatal for a production environment. >> Now with your analysis, I am thinking that I did this incorrectly with CHANGE >> MASTER and GTID. Maybe it should instead be like this: >> >> - A new command CHANGE GTID TO "0-1-2". This requires all slaves to be >> stopped. It replaces CHANGE MASTER TO MASTER_GTID_POS="0-1-2". >> >> - A new command SHOW GTID STATUS, replaces the Gtid_Pos field in SHOW ALL >> SLAVES STATUS. Do we really need the new syntax? I'd think, if GTID position is a global value, we could just make it a global dynamic variable. Then, SHOW GTID STATUS would also be not needed, since it would only return a single value – we can just as well do SHOW VARIABLES or SELECT @@gtid_position (or whatever it's called). Is there any reason why it wouldn't work? >> replaces the Gtid_Pos field in SHOW ALL >> SLAVES STATUS. I think that showing the value in SHOW ALL SLAVES STATUS doesn't hurt, and maybe even beneficial from the usability perspective, so, if it comes for a low price, it could stay there as well. >> This makes it clear that GTID state is global on the server, separate from any >> slave connection configuration. And clear that the individual slave connection >> can be using GTID to connect (MASTER_USE_GTID=1) or old style position >> (MASTER_USE_GTID=0). >> >> What do you think? I now understand that this is how I meant things to work, >> though I never formulated it explicitly like this before. Yes, if my current understanding of how things are meant to work is any close to the truth, the proposed changes sound quite logical. >> > RESET SLAVE is supposed to do just that, it is defined as >> > "makes the slave forget its replication position in the master's binary log. This statement is meant to be used for a clean start", >> > and it used to do just that; but it doesn't anymore. >> I just read the documentation, indeed that is what it says. But it's rubbish, >> isn't it? Possibly, but that's how things used to work, and I'm pretty sure a number of people used it in their own, however tricky, ways – either in "toy" (in fact, just low-traffic) setups, or in conjuction with RESET MASTER (on master), etc. It wouldn't be very kind to make radical changes in the way things are supposed to work, especially because it's not easy to explain on high level why the algorithm has to be different with and without GTID. >> I won't say this is good behaviour, but it at least seems consistent with how >> it worked before. Or what do you think? Yes, I think it's better to keep it consistent with the old behavior for the time being. I might get back to you regarding this after I have tried it (I didn't check the new version yet). >> I would prefer giving an error in case no position specified, but that >> is probably out due to backwards compatibility? Personally, I don't see a big tragedy in doing RESET SLAVE without providing a master position afterwards. I mean, it seems natural to consider it as a shortcut of master_pos/master_log_file=<start from the beginning of whatever we have>. The absence of massive complaints about slave starting after RESET from a non-zero position due to previously purged master binlogs indirectly confirms that people don't have a problem with this. So, I'd rather keep it as is. >> At least, if we can educate the user that GTID state is set separately with >> CHANGE GTID, it should be clearer that CHANGE MASTER MASTER_USE_GTID=1 starts >> from whatever SHOW GTID STATUS displays, and that RESET MASTER reverst to >> MASTER_USE_GTID=0. Right. Although, same way as in the previous note about old-style master position, I wouldn't find it wrong if we considered the "empty" GTID value default and use it if nothing else was previously set; but if you prefer insisting on always setting it, either manually or through automatic discovery, I don't have strong objections against it, either. >> Just as Gtid_Pos in SHOW ALL SLAVES STATUS is wierd, because >> it is not per-slave it is global. Hm.. Actually, I don't see anything weird in showing GTID position in SHOW ALL SLAVES STATUS (as opposed to SHOW SLAVE STATUS), exactly because it's global for all slaves. (Of course, it becomes somewhat strange that SHOW ALL SLAVES STATUS is not the same as SHOW SLAVE STATUS when we only have one slave, but that's another story). >> > decided to move my master server to >> > another host. Slave is fully synchronized, backups are in place, so I just >> > stop replication, shut down master, move the data files (but not binlogs, I >> > don't need them) to the new host, start master – effectively, it's the same >> > as RESET MASTER. I do RESET SLAVE ALL since I need to forget both the >> > position and connection parameters, set up replication again, start slave... >> user >> needs to be aware that RESET MASTER (or your above equivalent) is dangerous >> with GTID. Because it starts GTID generation from scratch, so now you have >> duplicate GTIDs in your system, unless you carefully remove the old ones >> everywhere. At least you get an error message in most cases rather than silent >> corruption. >> Once you see the error and issue CHANGE MASTER TO MASTER_GTID_POS='' (or >> CHANGE GTID TO ''), things should work again. >> The "recommended" way to do the above would be to copy the binlog files along >> also (maybe purge all logs but the latest first). Then there would be no need >> for RESET SLAVE, just CHANGE MASTER TO the new host and port, and GTID would >> connect automatically at the correct position That's exactly the case when I'm concerned about owners of simple setups, and how things become somewhat more complicated for them, or at least different. I don't know what real users do in a situation like that, but if I were one of them, I would do exactly as I described, because it's simpler . This way, I don't need to do any purge on old master, I don't need to move an extra log (which might be quite big), I don't need to remember which exact parameters I must modify in CHANGE MASTER (what if I forget to change host? After RESET SLAVE ALL, I'll get a clear error, without RESET SLAVE ALL the slave will attempt to connect to the old master, and lucky me if there is nothing else running on that host/port at the moment; etc.). >> > As a User3, I want to create a multi-source setup. >> > I configured m1 as >> > CHANGE MASTER 'm1' master_host=.., ..., master_gtid_pos='' >> > started the slave 'm1', it has been working for a while for now. >> > Now I want to add another master. I do exactly the same: I run >> > CHANGE MASTER 'm2' ... master_gtid_pos=''; >> You do not need to specify master_gtid_pos='' in the second CHANGE >> MASTER. This will be clearer with the change to CHANGE GTID: >> CHANGE GTID TO ''; >> CHANGE MASTER 'm1' ... master_using_gtid=1; >> CHANGE MASTER 'm2' ... master_using_gtid=1; Yes, it's much clearer this way. My point was, I'd expect slaves to be symmetrical, while it was very much not so before. >> It would be nice if I could implement that one >> could ask to connect the first time with old-style position, but then the next >> time with GTID. Is it difficult to implement? Frankly, I thought that auto means pretty much that... Even more so if we have CHANGE MASTER .. master_using_gtid=1|0, where 1 throws an error when the GTID position is not set; then it would be logical to also have master_using_gtid=auto (or SET GLOBAL gtid = 'auto', whichever is more reasonable from implementation perspective), which would mean that the slave connects with an old-style position, acquires GTID position, sets it, and further connects using it. >> hopefully we are getting closer to >> something that is at least workable, if not as perfect as I had hoped >> initially ... You never know, maybe it turns out "perfect enough" at the end.. Although, of course, nothing is ever as perfect as we initially hope
              Hide
              knielsen Kristian Nielsen added a comment -

              I believe all of these issues should be resolved, as well as possible
              at least, with the new interface pushed recently (master_use_gtid=
              slave_pos|current_pos)

              Show
              knielsen Kristian Nielsen added a comment - I believe all of these issues should be resolved, as well as possible at least, with the new interface pushed recently (master_use_gtid= slave_pos|current_pos)

                People

                • Assignee:
                  knielsen Kristian Nielsen
                  Reporter:
                  elenst Elena Stepanova
                • Votes:
                  0 Vote for this issue
                  Watchers:
                  1 Start watching this issue

                  Dates

                  • Created:
                    Updated:
                    Resolved: