We're updating the issue view to help you get more done.Learn more

Enhanced semisync replication

Enhanced semi-synchronous replication does COMMIT in the following way:

1. Prepare the transaction in the storage engine(s).

2. Write the transaction to the binlog, flush the binlog to disk.

3. Wait for at least one slave to acknowledge the reception of the binlog
events for the transaction.

4. Commit the transaction to the storage engine(s).

This is different from normal semi-synchronous replication, where steps (3)
and (4) are reversed.

This task is about implementing enhanced semi-synchronous replication in a way
that interacts well with MariaDB group commit. In Oracle MySQL, the enhanced
semi-synchronous replication would be very expensive, as the global
prepare_commit_mutex is held over the entire operation, which would seriously
limit throughput. With MariaDB, a whole group of transactions can enter each
stage in parallel, so high thoughput can be maintained.

A benefit of enhanced semi-synchronous replication is that a transaction does
not become visible until at least one slave has acknowledged the reception of
it. This means that if a master is completely lost, any transaction seen by
other connections will be replicated somewhere, avoiding a potential phantom
read issue.

For more discussion see

http://www.mysqlperformanceblog.com/2012/01/19/how-does-semisynchronous-mysql-replication-work/

Implementation in MariaDB group commit

In MariaDB group commit, a group of commits queue up while waiting for the
previous group to finish. This happens during/just after the prepare step
(1).

Once the previous group finishes, we have in step (2) a list of commits that
we write to the binary log.

To implement enhanced semi-synchronous replication, we simply add a step just
after (2) where we wait for slave acknowledgement of the last binlog position
of the entire group. We introduce a new mutex for this, so that we can release
LOCK_log before the wait and wait with taking LOCK_commit_ordered until after
the wait; this allows stage (3) to run in parallel with stage (2) and (4),
while still preserving correct ordering and avoiding one stage getting ahead
of the other.

The mutexes must be chained, meaning that we must take the next lock before
releasing the previous (otherwise one group might overtake previous group,
causing incorrect ordering of events):

    ... stage (2) end ...
    lock LOCK_enhanced_semisync
    unlock LOCK_log
    ... stage (3) wait for slave ...
    lock LOCK_commit_ordered
    unlock LOCK_enhanced_semisync
    ... stage (4) begin ...

See the code in sql/log.cc, MYSQL_BIN_LOG::trx_group_commit_leader() for
details.

The stage (3) should be added as another kind of hook (semi-sync replication
is plugin-based using such hooks). We will use the
--rpl_semi_sync_master_wait_before_commit=1 option to enable enhanced
semi-synchronous replication, following the Google patch

http://code.google.com/p/enhanced-semi-sync-replication/

When --rpl_semi_sync_master_wait_before_commit=1, semi-sync plugin can use the
new hook instead of the current after_commit hook.

Crash scenarios

If a master crashes before a transaction T is written into the binlog, that
transaction will be rolled back during crash recovery upon server restart, as
normal.

If T was written (and synced) into binlog, but not yet acknowledged by any
slave, and master crashes, then T will be committed during crash recovery. In
this case, it is possible for a connection to see T committed on the master
before any slave has had time to connect to the master and receive it. Thus,
if we crash again right after crash recovery and completely loose the master,
it is possible for a connection to have seen T on the master while T is now
effectively missing from the system. To fix this, one option is to somehow
have the master wait after crash recovery for at least one slave to connect
and acknowledge all recovered commits, thus extending the semi-sync to crash
recovery phase. An alternative may be for the DBA to prevent connections to
the server after a crash until at least one slave has caught up
(SHOW MASTER STATUS on master and select master_pos_wait() on slave).

If T was acknowledged by at least one slave, then we know that T exists both
in master binlog (which is synced before sending to slaves) and slave
relay-log. Thus, when master crash recovery is done, T will be on both master
and that slave. And if we completely loose the master, T will still eventually
be applied on the slave (unless we loose both master and slave at the same
time).

If a slave crashes during the commit on master, nothing special should
happen, unless all connected slaves crash, leaving the master without any
slaves connected.

In this case the situation is much as with normal semisync. Commits will be
stalled until timeout. They will be stalled a bit earlier (before InnoDB
commit rather than after), so row locks will not have been released yet —
otherwise the result is much the same. I need to check if semisync is able to
detect the TCP close from all slaves and fail faster in this case — however,
this does not help for the case when power failure takes out the slave without
any notice sent on the network.

Pending XID issue

One issue that needs to be dealt with is the potential deadlock described in
this bug report (point 5):

http://bugs.mysql.com/bug.php?id=44058

The problem is that when the server wants to rotate the binlog, it takes the
LOCK_log mutex and holds it while it waits for all pending commits to
finish. But LOCK_log prevents slaves from receiving events, which prevents
slave acks, which prevents pending commits to finish.

This can be worked around, of course — as eg. done in the Google enhanced
semisync patch. But I do not like this work-around — in introduces even more
complication into what is already a bad design.

I would prefer to instead solve the root problem — that server needs to stall
commits when rotating the binlog. This solves a number of issues. See a
description for this here:

https://mariadb.atlassian.net/browse/MDEV-181

Status