Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-4820

GTID strict mode is full of bugs and doesn't serve its purpose

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 10.0.3
    • Fix Version/s: 10.0.5
    • Component/s: None
    • Labels:
      None

      Description

      Let's consider the following testing setup (I use code at revision 3773 of 10.0 branch).
      Start 2 MariaDB servers with "--gtid-strict-mode --replicate-wild-ignore-table=mysql.gtid_slave_pos". Set server 2 a slave of server 1. Execute on server 1:
      create database test;
      use test;
      create table t (n int) engine innodb;
      insert into t values (1);
      After that @@global.gtid_current_pos is '0-1-3' on both servers.
      Now imagine production situation: server 2 goes down, server 1 continues to be a master and execute transactions, then at some point it's taken down for cold backup, restored on a new machine without binlogs, but @@global.gtid_slave_pos is set to the value of @@global.gtid_current_pos that was set at the moment of server going down. And after that server continues to be a master.
      Let's emulate this situation: stop slave on server 2, bring down server 1, delete all master-bin.* files, bring up server 1, set @@global.gtid_slave_pos = '0-1-5', start slave on server 2. And what do you know, server 2 doesn't have any errors. And if I execute new transactions on server 1 they are happily replicated. So server 2 skipped transactions and no one noticed that. That's not how strict mode should work.

      Let's continue the experiment. Let's say we stopped at the GTID '0-1-6'. Now
      "stop slave" on server 2
      "reset slave all" on server 2
      shutdown server 2
      delete all master-bin.* files
      bring up server 2
      "set @@global.gtid_slave_pos = '0-2-10'"
      "change master to" on server 1 to make server 2 master
      "start slave" on server 1
      try to execute transactions on server 2
      For some reason at this point server 1 doesn't have any errors and doesn't replicate anything from server 2. Oops. If after advancing gtid_current_pos on server 2 we "stop slave" on server 1 and "start slave" on server 1 then we start seeing error: "connecting slave requested to start from GTID 0-1-6, which is not in the master's binlog". This is the expected behavior. Why it couldn't show this error at the very beginning, before server 2 had any events in the binlog?

      Now moving further. Let's say we restored replication and stopped at gtid_current_pos = '0-2-11'. Now
      "stop slave" on server 1
      execute transaction on server 2
      execute transaction on server 1
      At this point server 1 has gtid_current_pos = '0-1-12' and server 2 has gtid_current_pos = '0-2-12', i.e they have alternate futures. Now if we make server 1 master and connect server 2 to it server 2 will show error "connecting slave requested to start from GTID 0-2-12, which is not in the master's binlog". This is not helpful. Alternate future is a very serious problem and there should be an easy and clear way to detect this situation. The error message is the obvious choice for detection tools but MariaDB doesn't distinguish it from the "slave start from GTID that is before binlogs were started" situation. Can this be changed? I've already requested this behavior in MDEV-4478, but apparently it was either forgotten, or for some reason you decided not to implement it. If the latter I'd like to hear why.

        Gliffy Diagrams

          Attachments

            Activity

            Hide
            pivanof Pavel Ivanov added a comment -

            I'm attaching a patch with my approach to resolving this bug. It looks like it covers all possible use cases in GTID strict mode. I couldn't figure out what would be the intended behavior in such use cases for the server in non-strict mode, so I didn't change that. Also I didn't check if START SLAVE UNTIL still works properly in all cases with GTID strict mode.

            Show
            pivanof Pavel Ivanov added a comment - I'm attaching a patch with my approach to resolving this bug. It looks like it covers all possible use cases in GTID strict mode. I couldn't figure out what would be the intended behavior in such use cases for the server in non-strict mode, so I didn't change that. Also I didn't check if START SLAVE UNTIL still works properly in all cases with GTID strict mode.
            Hide
            knielsen Kristian Nielsen added a comment -

            I cannot repeat the first part. This is using 10.0-base revision
            revid:igor@askmonty.org-20130806203318-esxb7kpq9kab0i97

            Here is my test case:

            --let $rpl_topology=1->2
            --source include/rpl_init.inc
            
            --connection server_2
            --source include/stop_slave.inc
            SET GLOBAL gtid_strict_mode= 1;
            CHANGE MASTER TO master_use_gtid=slave_pos;
            --source include/start_slave.inc
            
            --connection server_1
            SET GLOBAL gtid_strict_mode= 1;
            CREATE TABLE t1 (a INT PRIMARY KEY);
            INSERT INTO t1 VALUES (1);
            --save_master_pos
            
            --connection server_2
            --sync_with_master
            SELECT * FROM t1 ORDER BY a;
            
            --source include/stop_slave.inc
            
            --connection server_1
            INSERT INTO t1 VALUES (2);
            INSERT INTO t1 VALUES (3);
            SET @old_gtid_pos= @@GLOBAL.gtid_current_pos;
            RESET MASTER;
            SET GLOBAL gtid_slave_pos= @old_gtid_pos;
            
            --connection server_2
            --source include/start_slave.inc
            
            --connection server_1
            INSERT INTO t1 VALUES (4);
            --save_master_pos
            
            --connection server_2
            --sync_with_master
            SELECT * FROM t1 ORDER BY a;
            
            # Clean up.
            --connection server_1
            DROP TABLE t1;
            
            --source include/rpl_end.inc
            

            As expected, the slave fails to connect with the error: "[ERROR] Slave I/O: Got fatal error 1236 from master when reading data from binary log: 'The binlog on the master is missing the GTID 0-1-2 requested by the slave (even though both a prior and a subsequent sequence number does exist), and GTID strict mode is enabled', Internal MariaDB error code: 1236"

            This is as expected. If the requested position is missing in the binlogs on
            the master, it must match exactly with @@GLOBAL.gtid_slave_pos.

            My guess is you are looking at old code. The most recent code for GTID is in
            10.0-base, merges to 10.0 happen only irregularly.

            Show
            knielsen Kristian Nielsen added a comment - I cannot repeat the first part. This is using 10.0-base revision revid:igor@askmonty.org-20130806203318-esxb7kpq9kab0i97 Here is my test case: --let $rpl_topology=1->2 --source include/rpl_init.inc --connection server_2 --source include/stop_slave.inc SET GLOBAL gtid_strict_mode= 1; CHANGE MASTER TO master_use_gtid=slave_pos; --source include/start_slave.inc --connection server_1 SET GLOBAL gtid_strict_mode= 1; CREATE TABLE t1 (a INT PRIMARY KEY); INSERT INTO t1 VALUES (1); --save_master_pos --connection server_2 --sync_with_master SELECT * FROM t1 ORDER BY a; --source include/stop_slave.inc --connection server_1 INSERT INTO t1 VALUES (2); INSERT INTO t1 VALUES (3); SET @old_gtid_pos= @@GLOBAL.gtid_current_pos; RESET MASTER; SET GLOBAL gtid_slave_pos= @old_gtid_pos; --connection server_2 --source include/start_slave.inc --connection server_1 INSERT INTO t1 VALUES (4); --save_master_pos --connection server_2 --sync_with_master SELECT * FROM t1 ORDER BY a; # Clean up. --connection server_1 DROP TABLE t1; --source include/rpl_end.inc As expected, the slave fails to connect with the error: " [ERROR] Slave I/O: Got fatal error 1236 from master when reading data from binary log: 'The binlog on the master is missing the GTID 0-1-2 requested by the slave (even though both a prior and a subsequent sequence number does exist), and GTID strict mode is enabled', Internal MariaDB error code: 1236" This is as expected. If the requested position is missing in the binlogs on the master, it must match exactly with @@GLOBAL.gtid_slave_pos. My guess is you are looking at old code. The most recent code for GTID is in 10.0-base, merges to 10.0 happen only irregularly.
            Hide
            knielsen Kristian Nielsen added a comment -

            Hm, I cannot reproduce either on rev 3773 of branch 10.0, slave gets an error message on connect.

            Can you please elaborate how to reproduce / how the situation you describe differ from my test case?

            Show
            knielsen Kristian Nielsen added a comment - Hm, I cannot reproduce either on rev 3773 of branch 10.0, slave gets an error message on connect. Can you please elaborate how to reproduce / how the situation you describe differ from my test case?
            Hide
            knielsen Kristian Nielsen added a comment -

            For the second part of the problem: We cannot give an error when server 1
            connects to server 2. By deleting the binlogs on server 2, it is effectively a
            fresh server, it is perfectly valid to start replicating from it (eg. using a
            different domain_id). But we should give an error when we receive the first,
            incorrect event in the domain 0 (but the code does not currently), I will fix
            that.

            For the third part: If I understand correctly, you want the server to give
            different error messages for these two cases:

            • Slave requests to start at some point G from master. Master does not have
              G, but it is itself a slave of an upstream master, and will receive G
              shortly.
            • Slave requested to start at a GTID that does not exist on the master, and
              never will (what you refer to as "alternate future").

            In most cases we can determine which one it is simply by looking at the
            sequence numbers, that is probably a good idea. I will try to come up with
            something (but I don't consider wording of error messages urgent, so not
            immediately).

            However, note that both of these cases are distinct from "slave requests to
            start from a GTID that has been purged". This already has a separate error
            message. However, by deleting binlogs, the information needed to distinguish
            this case is lost.

            Show
            knielsen Kristian Nielsen added a comment - For the second part of the problem: We cannot give an error when server 1 connects to server 2. By deleting the binlogs on server 2, it is effectively a fresh server, it is perfectly valid to start replicating from it (eg. using a different domain_id). But we should give an error when we receive the first, incorrect event in the domain 0 (but the code does not currently), I will fix that. For the third part: If I understand correctly, you want the server to give different error messages for these two cases: Slave requests to start at some point G from master. Master does not have G, but it is itself a slave of an upstream master, and will receive G shortly. Slave requested to start at a GTID that does not exist on the master, and never will (what you refer to as "alternate future"). In most cases we can determine which one it is simply by looking at the sequence numbers, that is probably a good idea. I will try to come up with something (but I don't consider wording of error messages urgent, so not immediately). However, note that both of these cases are distinct from "slave requests to start from a GTID that has been purged". This already has a separate error message. However, by deleting binlogs, the information needed to distinguish this case is lost.
            Hide
            pivanof Pavel Ivanov added a comment -

            Note that the patch I've attached have test case that should reproduce the problems.

            Regarding your code: I'm not so sure that RESET MASTER is equivalent to stopping server, deleting binlogs and starting again. I'd think that some in-memory structures are not cleaned.

            Regarding second part: note that test case doesn't say anything about different domain_id – it's about different server_id. Also note that server 1 doesn't replicate at all when first connecting to server 2. And in strict mode server 2 can send error to server 1 right away, because it doesn't have GTID that is used by server 1 to connect.

            Regarding third part: you understood me incorrectly. I'm calling "alternate future" not the situation when slave requested GTID that doesn't exist on master. That's too generic. "Alternate future" is when master has GTID with the same domain_id and seq_no, but different server_id. In strict mode with correct failover process this situation shouldn't ever happen. So it must be detected to understand if failover gone wrong somewhere.
            When GTID doesn't exist on master it can be as you are saying master will receive this GTID shortly (although I don't know how server can detect that, and this situation shouldn't ever happen in strict mode and correct failover process). Also it can be that slave has GTID that is too old and master doesn't have the appropriate binlog already.

            Show
            pivanof Pavel Ivanov added a comment - Note that the patch I've attached have test case that should reproduce the problems. Regarding your code: I'm not so sure that RESET MASTER is equivalent to stopping server, deleting binlogs and starting again. I'd think that some in-memory structures are not cleaned. Regarding second part: note that test case doesn't say anything about different domain_id – it's about different server_id. Also note that server 1 doesn't replicate at all when first connecting to server 2. And in strict mode server 2 can send error to server 1 right away, because it doesn't have GTID that is used by server 1 to connect. Regarding third part: you understood me incorrectly. I'm calling "alternate future" not the situation when slave requested GTID that doesn't exist on master. That's too generic. "Alternate future" is when master has GTID with the same domain_id and seq_no, but different server_id. In strict mode with correct failover process this situation shouldn't ever happen. So it must be detected to understand if failover gone wrong somewhere. When GTID doesn't exist on master it can be as you are saying master will receive this GTID shortly (although I don't know how server can detect that, and this situation shouldn't ever happen in strict mode and correct failover process). Also it can be that slave has GTID that is too old and master doesn't have the appropriate binlog already.
            Hide
            knielsen Kristian Nielsen added a comment -

            Fix pushed to 10.0-base:

            Revision: revid:knielsen@knielsen-hq.org-20130816131025-etjrvmfvupsjzq83

            MDEV-4820: Empty master does not give error for slave GTID position that does not exist in the binlog

            The main bug here was the following situation:

            Suppose we set up a completely new master2 as an extra multi-master to an
            existing slave that already has a different master1 for domain_id=0. When the
            slave tries to connect to master2, master2 will not have anything that slave
            requests in domain_id=0, but that is fine as master2 is supposedly meant to
            serve eg. domain_id=1. (This is MDEV-4485).

            But suppose that master2 then actually starts sending events from
            domain_id=0. In this case, the fix for MDEV-4485 was incomplete, and the code
            would fail to give the error that the position requested by the slave in
            domain_id=0 was missing from the binlogs of master2. This could lead to lost
            events or completely wrong replication.

            The patch for this bug fixes this issue.

            In addition, it cleans up the code a bit, getting rid of the fake_gtid_hash in
            the code. And the error message when slave and master have diverged due to
            alternate future is clarified, as requested in the bug description.

            Show
            knielsen Kristian Nielsen added a comment - Fix pushed to 10.0-base: Revision: revid:knielsen@knielsen-hq.org-20130816131025-etjrvmfvupsjzq83 MDEV-4820 : Empty master does not give error for slave GTID position that does not exist in the binlog The main bug here was the following situation: Suppose we set up a completely new master2 as an extra multi-master to an existing slave that already has a different master1 for domain_id=0. When the slave tries to connect to master2, master2 will not have anything that slave requests in domain_id=0, but that is fine as master2 is supposedly meant to serve eg. domain_id=1. (This is MDEV-4485 ). But suppose that master2 then actually starts sending events from domain_id=0. In this case, the fix for MDEV-4485 was incomplete, and the code would fail to give the error that the position requested by the slave in domain_id=0 was missing from the binlogs of master2. This could lead to lost events or completely wrong replication. The patch for this bug fixes this issue. In addition, it cleans up the code a bit, getting rid of the fake_gtid_hash in the code. And the error message when slave and master have diverged due to alternate future is clarified, as requested in the bug description.

              People

              • Assignee:
                knielsen Kristian Nielsen
                Reporter:
                pivanof Pavel Ivanov
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: