If same GTID is received on multiple master connections in multi-source replication, the event is double-executed causing corruption or replication failure

Description

With multi-source replication, it is possible on a slave to receive the same
GTID twice, if there are multiple paths through which events can arrive from a
master. For example a three-node multimaster setup where each server
replicates as a slave from the two others.

With correctly configured GTID, it is easy to detect that an event has already
been received once; just compare the sequence number against the previous GTID
received within that domain, and ignore the event if already applied. But
currently, the code does not handle this correctly; instead it doubly-applies
the event, or in gtid strict mode stops with an error.

We cannot really fix this as default behaviour, as this could break upgrade of
existing setups, and also conflicts with the behaviour in strict mode of
giving an error, which is desired by some user. However, we can implement a
--gtid-ignore-duplicates option, which will enable this behaviour.

So instead of the gtid strict mode behaviour, which fails if we see D-S1-M
after D-S2-N (M<=N), we will handle it as follows:

If we receive T1=D-S1-M, and current position in D is D-S2-N with N >= M, we
drop T1.

If M > N, then the event needs to be applied, however we need to protect
against two different master connections trying to apply the same GTID at the
same time. So the first connection to see the new GTID with M > N is set as
the current owner of the domain, and starts applying the transaction. When it
is done and has committed, the domain is released and the owner is
cleared. Any second connection that received a GTID with M > N while the
domain is already reserved by another owner will need to wait until the
current owner is done, and then make the decision to either discard, or apply
(then becoming the new owner).

A good way to implement it seems to be to have a current owner of each domain,
in the form of the Relay_log_info. In parallel replication, we also have a
reference count of worker threads active in that domain. When a worker thread
gets the lock on the domain, it sets the owner (if unset), and increases the
reference count. When it is done, it decreases the count, and if it reaches
zero it removes the owner and signals any other waiters that the domain is now
free to grab vy someone else.

Normally, if a slave asks to connect at GTID position D-S2-N, but the master is
only at D-S1-M, M < N, then the slave will get an error that it is ahead of
the master. However, if --gtid-ignore-duplicates, then this is a normal
situation (eg. the GTID came directly from A->C, it did not yet come A->B, and
now C wants to connect as a slave to B). So we need to in this case have the
slave tell the master not to give an error; instead the master must simply
wait for the GTID D-S2-N to turn up and then start sending events to the
slave.

Environment

None

Status

Assignee

Kristian Nielsen

Reporter

Kristian Nielsen

Labels

None

Fix versions

Affects versions

Priority

Major
Configure