Performance problem in parallel replication with multi-level slaves

Description

In MariaDB 10.0, the primary way to get parallelism on the slave is applying
batches transactions that group committed together on the master. Thus, to get
good parallelism, it is necessary to have many transactions in each group
commit.

However, when using a multi-level replication hierarchy, like M->S1->S2, the
group commits on the slave S1 are done independently, they do not necessarily
match the group commits on M. Thus, it is easily possible that the groups on
S1 will be smaller than groups on M, which reduces parallelism on S2 compared
to on S1, possibly causing S2 to not be able to keep up.

However, on S1, we often in fact know that some transactions T1, T2, T3 were
group committed together on the master, and thus very likely could group
commit together on the slave also (we know this as long as the I/O thread has
had time to fetch all transactions from the master). Thus, we could have some
heuristics to wait more aggressively for all of T1, T2, T3 to queue up for
group commit before committing T1. This would preserve more of the group
commit batches on the original master M, making S2 able to have parallelism
similar to S1.

Another possibility to increase parallelism on S2 is to utilise
--binlog-commit-wait-* options to delay commits slightly and increase group
commit batch sizes. The slave S1 is able to do group commit of transactions T1
and T2, even if T1 and T2 are in different group commit batches, and so could
not replicate their query execution in parallel.

The --binlog-commit-wait-* options can be particularly effective on a slave,
as we have future transactions available in the relay log. So delaying commit
of T1 does not delay starting T2, unlike on the master where an application
may be waiting for T1 to commit before initiating T2. Thus, in theory, S1
could achieve very large group commit batches without any reduction in
throughput; the only visible effect would be a moderate increase in the
latency before an application sees a transaction on S1.

However, this theory fails if T2 has a row lock conflict on T1. Then T2 will
have to wait for T1 to commit. So if T1 has a high --binlog-commit-wait-usec
delay, then the slave will waste a lot of time waiting. Thus, increasing
--binlog-commit-wait-usec is currently dangerous on a slave, as depending on
the precise application load it might cause the slave to lag behind.

We could avoid this problem, as the slave in fact already has the information
that T2 is waiting for a row lock of T1. (This information is provided by
InnoDB, and is needed to break a possible deadlock if T2 would be waiting for
a later T3). So whenever we see T2 waiting for T1, we could notify the group
commit code to abort any --binlog-commit-wait-usec delay and group commit
immediately. This way, we avoid stalling replication on a high
--binlog-commit-wait-usec value, and still are able to collect large batches
of group commits when possible.

We could have --binlog-commit-wait-heuristics=follow_master_commit|detect_conflict.
follow_master_commit would break the wait if the group commit is as large as
the one from the master. detect_conflict would break the wait if a row lock
conflict is detected. If too large a change for GA, we could have it off by
default in 10.0 and default to detect_conflict in 10.1.

We have a user who tested and was able to get very good parallel replication
speedup on S1, but was limited on S2 due to these issue.

Environment

None

Status

Assignee

Kristian Nielsen

Reporter

Kristian Nielsen

External issue ID

None

External issue ID

None

Components

Fix versions

Affects versions

Priority

Major
Configure