We're updating the issue view to help you get more done. 

Incorrect relay log start position when restarting SQL thread after error in parallel replication

Description

Incorrect relay log start position when restarting SQL thread after error in parallel replication

Suppose we are using parallel replication in GTID mode, using multiple
replication domains to run independent things in parallel.

Now suppose we get an error in one domain that causes replication to fail and
stop. Suppose further that the domain that fails is somewhat behind another
domain in the relay log. Eg. we might have events D1 D2 E1, event D1 in domain
D fails when E1 in domain E has already been replicated and committed.

In this case, the replication SQL thread will stop with current GTID position
"D1,E1". However, the IO thread keeps running.

Now suppose the SQL thread is restarted without first stopping the IO
thread. In this case, the SQL thread needs to continue from where it came to
in the relay log (in contrast, if the IO thread was also stopped first, then
the relay log will be deleted, and everything fetched anew from the master
starting at GTID position "D1,E1").

In this situation, it is necessary for the restarted SQL thread to start at
the correct position in the relay log corresponding to the GTID position
"D1,E1". Thus, the domain D must start at an earlier point than the domain
E. The SQL driver thread must start fetching at the start of D1, but then skip
E1 as it has already been executed.

Unfortunately, currently there is no such code to handle this situation
correctly. So what happens is that the SQL will restart from after the latest
event that was replicated correctly, losing any transactions in different
replication domains that occur earlier in the relay log and were not yet
replicated due to parallel replication. In the example, restart will begin after
E1, losing events D1 and D2.

So this is a rather serious problem.

A work-around is to stop the IO thread before restarting the SQL thread after
an error. This will cause the relay logs to be purged and fetched anew from
the master - and this code path does correctly handle starting each
replication domain in the correct position. This workaround can be done
manually by the user, or we could implement it as a temporary work-around
until the proper fix can be made.

To fix this, I think we need to implement proper GTID position search in the
relay log. Thus, when the SQL thread starts in GTID mode, it needs to find the
start position in the relay log based on the current GTID position, same way
as happens on the master when the IO thread connects over the network to
request events sent from a particular GTID position.

[This will then also prepare the way for later implementing that we can
preserve the relay log on the slave when the slave server is restarted (so we
do not need to re-fetch already fetched but not executed events). This however
will in addition require that crash recovery is implemented for the relay log]

Here is a test case for the bug:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 --source include/have_innodb.inc --let $rpl_topology=1->2 --source include/rpl_init.inc --connection server_2 --source include/stop_slave.inc CHANGE MASTER TO master_use_gtid=slave_pos; --source include/start_slave.inc --echo *** MDEV-6551: Some replication errors are ignored if slave_parallel_threads > 0 *** --connection server_1 ALTER TABLE mysql.gtid_slave_pos ENGINE=InnoDB; CREATE TABLE t1 (a int PRIMARY KEY) ENGINE=InnoDB; INSERT INTO t1 VALUES (1); --source include/save_master_gtid.inc --connection server_2 --source include/sync_with_master_gtid.inc SET @old_parallel_threads=@@GLOBAL.slave_parallel_threads; --source include/stop_slave.inc SET GLOBAL slave_parallel_threads=10; --connection server_1 SET @old_dom= @@gtid_domain_id; SET gtid_domain_id= 1; INSERT INTO t1 VALUES (2); INSERT INTO t1 VALUES (3); SET gtid_domain_id= 2; INSERT INTO t1 VALUES (4); INSERT INTO t1 VALUES (5); SET gtid_domain_id= 1; INSERT INTO t1 VALUES (6); SET gtid_domain_id= @old_dom; --source include/save_master_gtid.inc # Set a local row conflict to cause domain_id=1 to have to wait. --connect (s2,127.0.0.1,root,,test,$SERVER_MYPORT_2,) --connection s2 SET sql_log_bin= 0; BEGIN; INSERT INTO t1 VALUES (2); --connection server_2 --source include/start_slave.inc # Wait for domain_id=2 to have progressed further into the relay log than # the domain_id=1. --let $wait_condition= SELECT COUNT(*) = 1 FROM t1 WHERE a=5 --source include/wait_condition.inc # Now force a duplicate key error for domain_id=1. --connection s2 COMMIT; SET sql_log_bin= 1; --connection server_2 --let $slave_sql_errno= 1062 --source include/wait_for_slave_sql_error.inc # Now resolve the error, and restart the SQL thread. # The bug was that it would start at the point up to which domain_id=2 was # replicated, skipping the earlier events from domain_id=1 that were not yet # replicated due to the duplicate key error. SET sql_log_bin= 0; DELETE FROM t1 WHERE a=2; SET sql_log_bin= 1; --source include/start_slave.inc --source include/sync_with_master_gtid.inc # Check that everything was replicated, without losing any rows. SELECT * FROM t1 ORDER BY a; # Clean up. --connection server_2 --source include/stop_slave.inc SET GLOBAL slave_parallel_threads=@old_parallel_threads; --source include/start_slave.inc --connection server_1 DROP TABLE t1; --source include/rpl_end.inc

Environment

None

Status

Assignee

Kristian Nielsen

Reporter

Kristian Nielsen

Components

Fix versions

Affects versions

10.0.13

Priority

Critical