Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-6608

Slave loses master binlog filename when master crashes in the middle of writing an event group

    Details

    • Type: Bug
    • Status: Stalled
    • Priority: Critical
    • Resolution: Unresolved
    • Affects Version/s: 10.0.13
    • Fix Version/s: None
    • Component/s: None
    • Labels:

      Description

      If the master crashes in the middle of writing an event group to the binlog,
      the slave can receive this partial event group once the master is restarted.

      The slave code is written to handle this by recognising the master restart on
      the format description event logged at the restart (it has a special flag only
      set for the first format description logged after restart). The slave will
      rollback the partial transaction and drop all temporary tables.

      But there is a bug in the update of master binlog position. The update of the
      current master filename normally happens during processing of the Rotate
      events. However, due to the crash, no Rotate event occurs at the end of the
      master-bin.000001. And the fake Rotate event sent from the master at reconnect
      after restart is not processed, because Rotate events in the middle of an
      event group are not processed.

      The result is that the slave ends up in a state where it has processed events
      in master-bin.000002, but the filename part of the replication position
      (eg. SHOW SLAVE STATUS) is still master-bin.000001.

      I checked MySQL, it seems they avoid this problem during binlog recovery at
      master restart after a crash. If they find a partial event group at the end of
      the last master binlog, they truncate the file to before that event.

      Here is an MTR test case. The test case fails by --sync-with-master not
      completing because of the incorrect binlog position. This test case uses
      parallel replication, but I tested that the same bug is present for
      non-parallel replication.

      --source include/have_innodb.inc
      --source include/have_binlog_format_statement.inc
      --let $rpl_topology=1->2
      --source include/rpl_init.inc
      
      --connection server_2
      --source include/stop_slave.inc
      CHANGE MASTER TO master_use_gtid= current_pos;
      --source include/start_slave.inc
      
      --connection server_1
      CREATE TABLE t1 (a INT PRIMARY KEY, b VARCHAR(100) CHARACTER SET utf8);
      --save_master_pos
      
      --connection server_2
      --sync_with_master
      --source include/stop_slave.inc
      
      
      --connection server_1
      SET gtid_domain_id= 1;
      DELETE FROM t1;
      ALTER TABLE t1 ENGINE=InnoDB;
      CREATE TEMPORARY TABLE t2 (a INT PRIMARY KEY);
      INSERT INTO t2 VALUES (1);
      INSERT INTO t2 VALUES (2);
      
      
      --connection default
      SET gtid_domain_id= 2;
      CREATE TEMPORARY TABLE t3 (a INT PRIMARY KEY);
      INSERT INTO t3 VALUES (10);
      INSERT INTO t3 VALUES (20);
      
      
      --connection server_1
      INSERT INTO t1 SELECT a, 'server_1' FROM t2;
      
      
      --connection default
      INSERT INTO t1 SELECT a, 'default' FROM t3;
      
      
      --connection server_1
      INSERT INTO t1 SELECT a+2, '+server_1' FROM t2;
      
      
      # Crash the master server in the middle of writing an event group.
      --write_file $MYSQLTEST_VARDIR/tmp/mysqld.1.expect
      wait
      EOF
      
      FLUSH TABLES;
      SET SESSION debug_dbug="+d,crash_before_writing_xid";
      --error 2006,2013
      INSERT INTO t1 SELECT a+4, '++server_1' FROM t2;
      
      --source include/wait_until_disconnected.inc
      --connection default
      --source include/wait_until_disconnected.inc
      
      --append_file $MYSQLTEST_VARDIR/tmp/mysqld.1.expect
      restart
      EOF
      
      --connection default
      --enable_reconnect
      --source include/wait_until_connected_again.inc
      
      --connection server_1
      --enable_reconnect
      --source include/wait_until_connected_again.inc
      
      INSERT INTO t1 VALUES (0, 1);
      --save_master_pos
      SHOW BINLOG EVENTS IN 'master-bin.000001';
      SHOW BINLOG EVENTS IN 'master-bin.000002';
      
      --connection server_2
      # Start the slave replicating the events.
      # The main thing to test here is that the slave will know that it
      # needs to abort the partially received event group, so that the
      # execution of format_description event will not wait infinitely
      # for a commit of the incomplete group that never happens.
      
      --source include/start_slave.inc
      --sync_with_master
      
      SELECT * FROM t1 ORDER BY a;
      SHOW STATUS LIKE 'Slave_open_temp_tables';
      
      
      --connection server_1
      DROP TABLE t1;
      
      --source include/rpl_end.inc
      

        Gliffy Diagrams

          Attachments

            Activity

            Hide
            knielsen Kristian Nielsen added a comment -

            I made a patch that solves the problem like the MySQL solution, by truncating
            away a problem part of the binlog in case of partial event at the end of the
            binlog during crash recovery.

            But on second thought, I really do not like this approach, I think it is
            wrong.

            Generally, it is a very bad idea to destroy data during crash recovery. The
            system crashed, this already indicates some serious problem. Destroying data
            runs a high risk of making a serious situation even worse.

            In this particular case, some (possibly temporary) problem reading some event
            could cause crash recovery to remove arbitrary part of the last binlog file,
            which could make later manual recovery significantly harder.

            And even if recovery would not be possible from a damaged-but-not-truncated
            binlog file, it could still prevent later investigation to determine the root
            cause of a crash.

            So I think I want to drop that patch, and instead try an approach that leaves
            the crashed binlog file intact, and instead ensures that the crashed binlog
            can be properly handled by the rest of the code.

            Show
            knielsen Kristian Nielsen added a comment - I made a patch that solves the problem like the MySQL solution, by truncating away a problem part of the binlog in case of partial event at the end of the binlog during crash recovery. But on second thought, I really do not like this approach, I think it is wrong. Generally, it is a very bad idea to destroy data during crash recovery. The system crashed, this already indicates some serious problem. Destroying data runs a high risk of making a serious situation even worse. In this particular case, some (possibly temporary) problem reading some event could cause crash recovery to remove arbitrary part of the last binlog file, which could make later manual recovery significantly harder. And even if recovery would not be possible from a damaged-but-not-truncated binlog file, it could still prevent later investigation to determine the root cause of a crash. So I think I want to drop that patch, and instead try an approach that leaves the crashed binlog file intact, and instead ensures that the crashed binlog can be properly handled by the rest of the code.

              People

              • Assignee:
                knielsen Kristian Nielsen
                Reporter:
                knielsen Kristian Nielsen
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated: