Details

      Description

      Enhanced semi-synchronous replication does COMMIT in the following way:

      1. Prepare the transaction in the storage engine(s).

      2. Write the transaction to the binlog, flush the binlog to disk.

      3. Wait for at least one slave to acknowledge the reception of the binlog
      events for the transaction.

      4. Commit the transaction to the storage engine(s).

      This is different from normal semi-synchronous replication, where steps (3)
      and (4) are reversed.

      This task is about implementing enhanced semi-synchronous replication in a way
      that interacts well with MariaDB group commit. In Oracle MySQL, the enhanced
      semi-synchronous replication would be very expensive, as the global
      prepare_commit_mutex is held over the entire operation, which would seriously
      limit throughput. With MariaDB, a whole group of transactions can enter each
      stage in parallel, so high thoughput can be maintained.

      A benefit of enhanced semi-synchronous replication is that a transaction does
      not become visible until at least one slave has acknowledged the reception of
      it. This means that if a master is completely lost, any transaction seen by
      other connections will be replicated somewhere, avoiding a potential phantom
      read issue.

      For more discussion see

      http://www.mysqlperformanceblog.com/2012/01/19/how-does-semisynchronous-mysql-replication-work/

      Implementation in MariaDB group commit

      In MariaDB group commit, a group of commits queue up while waiting for the
      previous group to finish. This happens during/just after the prepare step
      (1).

      Once the previous group finishes, we have in step (2) a list of commits that
      we write to the binary log.

      To implement enhanced semi-synchronous replication, we simply add a step just
      after (2) where we wait for slave acknowledgement of the last binlog position
      of the entire group. We introduce a new mutex for this, so that we can release
      LOCK_log before the wait and wait with taking LOCK_commit_ordered until after
      the wait; this allows stage (3) to run in parallel with stage (2) and (4),
      while still preserving correct ordering and avoiding one stage getting ahead
      of the other.

      The mutexes must be chained, meaning that we must take the next lock before
      releasing the previous (otherwise one group might overtake previous group,
      causing incorrect ordering of events):

          ... stage (2) end ...
          lock LOCK_enhanced_semisync
          unlock LOCK_log
          ... stage (3) wait for slave ...
          lock LOCK_commit_ordered
          unlock LOCK_enhanced_semisync
          ... stage (4) begin ...
      

      See the code in sql/log.cc, MYSQL_BIN_LOG::trx_group_commit_leader() for
      details.

      The stage (3) should be added as another kind of hook (semi-sync replication
      is plugin-based using such hooks). We will use the
      --rpl_semi_sync_master_wait_before_commit=1 option to enable enhanced
      semi-synchronous replication, following the Google patch

      http://code.google.com/p/enhanced-semi-sync-replication/

      When --rpl_semi_sync_master_wait_before_commit=1, semi-sync plugin can use the
      new hook instead of the current after_commit hook.

      Crash scenarios

      If a master crashes before a transaction T is written into the binlog, that
      transaction will be rolled back during crash recovery upon server restart, as
      normal.

      If T was written (and synced) into binlog, but not yet acknowledged by any
      slave, and master crashes, then T will be committed during crash recovery. In
      this case, it is possible for a connection to see T committed on the master
      before any slave has had time to connect to the master and receive it. Thus,
      if we crash again right after crash recovery and completely loose the master,
      it is possible for a connection to have seen T on the master while T is now
      effectively missing from the system. To fix this, one option is to somehow
      have the master wait after crash recovery for at least one slave to connect
      and acknowledge all recovered commits, thus extending the semi-sync to crash
      recovery phase. An alternative may be for the DBA to prevent connections to
      the server after a crash until at least one slave has caught up
      (SHOW MASTER STATUS on master and select master_pos_wait() on slave).

      If T was acknowledged by at least one slave, then we know that T exists both
      in master binlog (which is synced before sending to slaves) and slave
      relay-log. Thus, when master crash recovery is done, T will be on both master
      and that slave. And if we completely loose the master, T will still eventually
      be applied on the slave (unless we loose both master and slave at the same
      time).

      If a slave crashes during the commit on master, nothing special should
      happen, unless all connected slaves crash, leaving the master without any
      slaves connected.

      In this case the situation is much as with normal semisync. Commits will be
      stalled until timeout. They will be stalled a bit earlier (before InnoDB
      commit rather than after), so row locks will not have been released yet —
      otherwise the result is much the same. I need to check if semisync is able to
      detect the TCP close from all slaves and fail faster in this case — however,
      this does not help for the case when power failure takes out the slave without
      any notice sent on the network.

      Pending XID issue

      One issue that needs to be dealt with is the potential deadlock described in
      this bug report (point 5):

      http://bugs.mysql.com/bug.php?id=44058

      The problem is that when the server wants to rotate the binlog, it takes the
      LOCK_log mutex and holds it while it waits for all pending commits to
      finish. But LOCK_log prevents slaves from receiving events, which prevents
      slave acks, which prevents pending commits to finish.

      This can be worked around, of course — as eg. done in the Google enhanced
      semisync patch. But I do not like this work-around — in introduces even more
      complication into what is already a bad design.

      I would prefer to instead solve the root problem — that server needs to stall
      commits when rotating the binlog. This solves a number of issues. See a
      description for this here:

      https://mariadb.atlassian.net/browse/MDEV-181

        Attachments

          Issue links

            Activity

              People

              • Assignee:
                knielsen Kristian Nielsen
                Reporter:
                ratzpo Rasmus Johansson
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - 1 week, 3 days, 1 hour
                  1w 3d 1h
                  Remaining:
                  Time Spent - 2 days, 5 hours, 30 minutes Remaining Estimate - 1 week, 3 hours, 30 minutes
                  1w 3h 30m
                  Logged:
                  Time Spent - 2 days, 5 hours, 30 minutes Remaining Estimate - 1 week, 3 hours, 30 minutes
                  2d 5h 30m