Uploaded image for project: 'MariaDB Server'
  1. MDEV-6589

Incorrect relay log start position when restarting SQL thread after error in parallel replication

    Details

    • Sprint:

      Description

      Incorrect relay log start position when restarting SQL thread after error in parallel replication

      Suppose we are using parallel replication in GTID mode, using multiple
      replication domains to run independent things in parallel.

      Now suppose we get an error in one domain that causes replication to fail and
      stop. Suppose further that the domain that fails is somewhat behind another
      domain in the relay log. Eg. we might have events D1 D2 E1, event D1 in domain
      D fails when E1 in domain E has already been replicated and committed.

      In this case, the replication SQL thread will stop with current GTID position
      "D1,E1". However, the IO thread keeps running.

      Now suppose the SQL thread is restarted without first stopping the IO
      thread. In this case, the SQL thread needs to continue from where it came to
      in the relay log (in contrast, if the IO thread was also stopped first, then
      the relay log will be deleted, and everything fetched anew from the master
      starting at GTID position "D1,E1").

      In this situation, it is necessary for the restarted SQL thread to start at
      the correct position in the relay log corresponding to the GTID position
      "D1,E1". Thus, the domain D must start at an earlier point than the domain
      E. The SQL driver thread must start fetching at the start of D1, but then skip
      E1 as it has already been executed.

      Unfortunately, currently there is no such code to handle this situation
      correctly. So what happens is that the SQL will restart from after the latest
      event that was replicated correctly, losing any transactions in different
      replication domains that occur earlier in the relay log and were not yet
      replicated due to parallel replication. In the example, restart will begin after
      E1, losing events D1 and D2.

      So this is a rather serious problem.

      A work-around is to stop the IO thread before restarting the SQL thread after
      an error. This will cause the relay logs to be purged and fetched anew from
      the master - and this code path does correctly handle starting each
      replication domain in the correct position. This workaround can be done
      manually by the user, or we could implement it as a temporary work-around
      until the proper fix can be made.

      To fix this, I think we need to implement proper GTID position search in the
      relay log. Thus, when the SQL thread starts in GTID mode, it needs to find the
      start position in the relay log based on the current GTID position, same way
      as happens on the master when the IO thread connects over the network to
      request events sent from a particular GTID position.

      [This will then also prepare the way for later implementing that we can
      preserve the relay log on the slave when the slave server is restarted (so we
      do not need to re-fetch already fetched but not executed events). This however
      will in addition require that crash recovery is implemented for the relay log]

      Here is a test case for the bug:

      --source include/have_innodb.inc
      --let $rpl_topology=1->2
      --source include/rpl_init.inc
      
      --connection server_2
      --source include/stop_slave.inc
      CHANGE MASTER TO master_use_gtid=slave_pos;
      --source include/start_slave.inc
      
      
      --echo *** MDEV-6551: Some replication errors are ignored if slave_parallel_threads > 0 ***
      
      --connection server_1
      ALTER TABLE mysql.gtid_slave_pos ENGINE=InnoDB;
      CREATE TABLE t1 (a int PRIMARY KEY) ENGINE=InnoDB;
      INSERT INTO t1 VALUES (1);
      --source include/save_master_gtid.inc
      
      --connection server_2
      --source include/sync_with_master_gtid.inc
      SET @old_parallel_threads=@@GLOBAL.slave_parallel_threads;
      --source include/stop_slave.inc
      SET GLOBAL slave_parallel_threads=10;
      
      --connection server_1
      SET @old_dom= @@gtid_domain_id;
      SET gtid_domain_id= 1;
      INSERT INTO t1 VALUES (2);
      INSERT INTO t1 VALUES (3);
      SET gtid_domain_id= 2;
      INSERT INTO t1 VALUES (4);
      INSERT INTO t1 VALUES (5);
      SET gtid_domain_id= 1;
      INSERT INTO t1 VALUES (6);
      SET gtid_domain_id= @old_dom;
      
      --source include/save_master_gtid.inc
      
      # Set a local row conflict to cause domain_id=1 to have to wait.
      --connect (s2,127.0.0.1,root,,test,$SERVER_MYPORT_2,)
      --connection s2
      SET sql_log_bin= 0;
      BEGIN;
      INSERT INTO t1 VALUES (2);
      
      --connection server_2
      --source include/start_slave.inc
      
      # Wait for domain_id=2 to have progressed further into the relay log than
      # the domain_id=1.
      --let $wait_condition= SELECT COUNT(*) = 1 FROM t1 WHERE a=5
      --source include/wait_condition.inc
      
      # Now force a duplicate key error for domain_id=1.
      --connection s2
      COMMIT;
      SET sql_log_bin= 1;
      
      --connection server_2
      --let $slave_sql_errno= 1062
      --source include/wait_for_slave_sql_error.inc
      
      # Now resolve the error, and restart the SQL thread.
      # The bug was that it would start at the point up to which domain_id=2 was
      # replicated, skipping the earlier events from domain_id=1 that were not yet
      # replicated due to the duplicate key error.
      SET sql_log_bin= 0;
      DELETE FROM t1 WHERE a=2;
      SET sql_log_bin= 1;
      
      --source include/start_slave.inc
      --source include/sync_with_master_gtid.inc
      
      # Check that everything was replicated, without losing any rows.
      SELECT * FROM t1 ORDER BY a;
      
      
      # Clean up.
      --connection server_2
      --source include/stop_slave.inc
      SET GLOBAL slave_parallel_threads=@old_parallel_threads;
      --source include/start_slave.inc
      
      --connection server_1
      DROP TABLE t1;
      
      --source include/rpl_end.inc
      

        Attachments

          Issue links

            Activity

              People

              • Assignee:
                knielsen Kristian Nielsen
                Reporter:
                knielsen Kristian Nielsen
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: