Details
-
Type:
Bug
-
Status: Stalled
-
Priority:
Critical
-
Resolution: Unresolved
-
Affects Version/s: 10.0.13
-
Fix Version/s: None
-
Component/s: None
-
Labels:
Description
If the master crashes in the middle of writing an event group to the binlog,
the slave can receive this partial event group once the master is restarted.
The slave code is written to handle this by recognising the master restart on
the format description event logged at the restart (it has a special flag only
set for the first format description logged after restart). The slave will
rollback the partial transaction and drop all temporary tables.
But there is a bug in the update of master binlog position. The update of the
current master filename normally happens during processing of the Rotate
events. However, due to the crash, no Rotate event occurs at the end of the
master-bin.000001. And the fake Rotate event sent from the master at reconnect
after restart is not processed, because Rotate events in the middle of an
event group are not processed.
The result is that the slave ends up in a state where it has processed events
in master-bin.000002, but the filename part of the replication position
(eg. SHOW SLAVE STATUS) is still master-bin.000001.
I checked MySQL, it seems they avoid this problem during binlog recovery at
master restart after a crash. If they find a partial event group at the end of
the last master binlog, they truncate the file to before that event.
Here is an MTR test case. The test case fails by --sync-with-master not
completing because of the incorrect binlog position. This test case uses
parallel replication, but I tested that the same bug is present for
non-parallel replication.
--source include/have_innodb.inc --source include/have_binlog_format_statement.inc --let $rpl_topology=1->2 --source include/rpl_init.inc --connection server_2 --source include/stop_slave.inc CHANGE MASTER TO master_use_gtid= current_pos; --source include/start_slave.inc --connection server_1 CREATE TABLE t1 (a INT PRIMARY KEY, b VARCHAR(100) CHARACTER SET utf8); --save_master_pos --connection server_2 --sync_with_master --source include/stop_slave.inc --connection server_1 SET gtid_domain_id= 1; DELETE FROM t1; ALTER TABLE t1 ENGINE=InnoDB; CREATE TEMPORARY TABLE t2 (a INT PRIMARY KEY); INSERT INTO t2 VALUES (1); INSERT INTO t2 VALUES (2); --connection default SET gtid_domain_id= 2; CREATE TEMPORARY TABLE t3 (a INT PRIMARY KEY); INSERT INTO t3 VALUES (10); INSERT INTO t3 VALUES (20); --connection server_1 INSERT INTO t1 SELECT a, 'server_1' FROM t2; --connection default INSERT INTO t1 SELECT a, 'default' FROM t3; --connection server_1 INSERT INTO t1 SELECT a+2, '+server_1' FROM t2; # Crash the master server in the middle of writing an event group. --write_file $MYSQLTEST_VARDIR/tmp/mysqld.1.expect wait EOF FLUSH TABLES; SET SESSION debug_dbug="+d,crash_before_writing_xid"; --error 2006,2013 INSERT INTO t1 SELECT a+4, '++server_1' FROM t2; --source include/wait_until_disconnected.inc --connection default --source include/wait_until_disconnected.inc --append_file $MYSQLTEST_VARDIR/tmp/mysqld.1.expect restart EOF --connection default --enable_reconnect --source include/wait_until_connected_again.inc --connection server_1 --enable_reconnect --source include/wait_until_connected_again.inc INSERT INTO t1 VALUES (0, 1); --save_master_pos SHOW BINLOG EVENTS IN 'master-bin.000001'; SHOW BINLOG EVENTS IN 'master-bin.000002'; --connection server_2 # Start the slave replicating the events. # The main thing to test here is that the slave will know that it # needs to abort the partially received event group, so that the # execution of format_description event will not wait infinitely # for a commit of the incomplete group that never happens. --source include/start_slave.inc --sync_with_master SELECT * FROM t1 ORDER BY a; SHOW STATUS LIKE 'Slave_open_temp_tables'; --connection server_1 DROP TABLE t1; --source include/rpl_end.inc
Gliffy Diagrams
Attachments
Activity
- All
- Comments
- Work Log
- History
- Activity
- Transitions
I made a patch that solves the problem like the MySQL solution, by truncating
away a problem part of the binlog in case of partial event at the end of the
binlog during crash recovery.
But on second thought, I really do not like this approach, I think it is
wrong.
Generally, it is a very bad idea to destroy data during crash recovery. The
system crashed, this already indicates some serious problem. Destroying data
runs a high risk of making a serious situation even worse.
In this particular case, some (possibly temporary) problem reading some event
could cause crash recovery to remove arbitrary part of the last binlog file,
which could make later manual recovery significantly harder.
And even if recovery would not be possible from a damaged-but-not-truncated
binlog file, it could still prevent later investigation to determine the root
cause of a crash.
So I think I want to drop that patch, and instead try an approach that leaves
the crashed binlog file intact, and instead ensures that the crashed binlog
can be properly handled by the rest of the code.