Incorrect relay log start position when restarting SQL thread after error in parallel replication
Suppose we are using parallel replication in GTID mode, using multiple
replication domains to run independent things in parallel.
Now suppose we get an error in one domain that causes replication to fail and
stop. Suppose further that the domain that fails is somewhat behind another
domain in the relay log. Eg. we might have events D1 D2 E1, event D1 in domain
D fails when E1 in domain E has already been replicated and committed.
In this case, the replication SQL thread will stop with current GTID position
"D1,E1". However, the IO thread keeps running.
Now suppose the SQL thread is restarted without first stopping the IO
thread. In this case, the SQL thread needs to continue from where it came to
in the relay log (in contrast, if the IO thread was also stopped first, then
the relay log will be deleted, and everything fetched anew from the master
starting at GTID position "D1,E1").
In this situation, it is necessary for the restarted SQL thread to start at
the correct position in the relay log corresponding to the GTID position
"D1,E1". Thus, the domain D must start at an earlier point than the domain
E. The SQL driver thread must start fetching at the start of D1, but then skip
E1 as it has already been executed.
Unfortunately, currently there is no such code to handle this situation
correctly. So what happens is that the SQL will restart from after the latest
event that was replicated correctly, losing any transactions in different
replication domains that occur earlier in the relay log and were not yet
replicated due to parallel replication. In the example, restart will begin after
E1, losing events D1 and D2.
So this is a rather serious problem.
A work-around is to stop the IO thread before restarting the SQL thread after
an error. This will cause the relay logs to be purged and fetched anew from
the master - and this code path does correctly handle starting each
replication domain in the correct position. This workaround can be done
manually by the user, or we could implement it as a temporary work-around
until the proper fix can be made.
To fix this, I think we need to implement proper GTID position search in the
relay log. Thus, when the SQL thread starts in GTID mode, it needs to find the
start position in the relay log based on the current GTID position, same way
as happens on the master when the IO thread connects over the network to
request events sent from a particular GTID position.
[This will then also prepare the way for later implementing that we can
preserve the relay log on the slave when the slave server is restarted (so we
do not need to re-fetch already fetched but not executed events). This however
will in addition require that crash recovery is implemented for the relay log]
Here is a test case for the bug: