MariaDB Development
  1. MariaDB Development
  2. MDEV-232

Remove one fsync() inside engine's commit() method

    Details

    • Type: Task Task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Fix Version/s: 10.0.0
    • Labels:
      None
    • Global Rank:
      359

      Description

      Pre-requisite: MDEV-181

      1. Overview:

      When binlog is enabled and an InnoDB/XtraDB transaction is committed, we
      currently need no less than three calls to fsync():

      1. After the InnoDB prepare() step, to ensure that the transaction will not be
        lost in a crash after being written into the binlog.
      2. After writing the event to the binlog, to ensure that we will not commit in
        InnoDB a transaction that will subsequently be lost from binlog due to
        crash.
      3. After the InnoDB commit() step, to know that XA recovery will no longer
        need to consider the XID for this transaction during crash recovery.

      This task builds on MDEV-181 to remove the fsync() in step 3, without reducing
      any durability or consistency guarantees.

      The basic issue is that we need some way for storage engines to communicate to
      the server layer when a commit is durably flushed to disk. Once the server
      knows that all commits in a particular binlog file are durably flushed in all
      storage engines, it can write a new binlog checkpoint event, as the old file
      will then no longer be needed for crash recovery.

      Currently this communication is implicit in the commit() method of the
      handlerton: the engine guarantees that the transaction will be durably flushed
      during commit().

      But for engines that support the commit_ordered() interface, we know that
      commits will be done in the same order in the binlog and in InnoDB. So the
      server layer does not need to be informed of the flush of every transaction.
      All we need is to know when the last transaction in a binlog file is durably
      flushed to disk - then we are sure that all prior transactions are also
      flushed, and that the binlog file will no longer be needed for crash recovery,
      so we can log a new binlog checkpoint event.

      It might be possible to also handle storage engines that do not support
      commit_ordered(), but we do not consider this in this task.

      Thus, the basic idea is

      • When the server layer decides to rotate the current binlog, it queues to
        each supporting engine a request to be informed when the last transaction
        in the binlog is durably flushed to disk.
      • When an engine has flushed to disk the last commit in a binlog, it notifies
        the server asynchronously of the fact. When all engines have so informed
        the server layer, a new binlog checkpoint event is written.

      2. Extensions to the storage engine interface

      We will add a new optional method to the storage engine handlerton:

          int (*commit_checkpoint_request)(void *cookie);
      

      This will be called after the last commit is written into the binlog. The
      engine must queue this request and later inform the server layer
      asynchronously when all transactions committed prior to the call of
      commit_checkpoint_request() are durably flushed to disk (if such flush already
      happened, the engine can inform the server immediately during the call).

      The engine informs the server layer by calling a new server function

          void commit_checkpoint_notify(handlerton, cookie);
      

      passing the same cookie value that it received in commit_checkpoint_request().

      Engines that implement this method are no longer required to fsync() to disk
      before returning from commit() of a two-phase-committed transaction (though
      such fsync() is still needed in one-phase commit where no prepare() step is
      done). The server layer will ensure that the transaction will be recovered
      from the binlog if we crash.

      3. InnoDB extensions

      In InnoDB and XtraDB, we will implement the new commit_checkpoint_request()
      method to simply flush the InnoDB redo log and return.

      If desired, we can later refine this so that such flush is done
      asynchronously. Then commit_checkpoint_request() would just queue the flush
      request in a list - and when the redo log is next flushed (for example by the
      once-per-second background flush or a prepare() or InnoDB checkpoint), the
      list could be scanned and the appropriate entries removed from the list and
      reported as flushed.

      As commit_checkpoint_request() only happens once per binlog rotation, this is
      not likely to be performance critical.

      We will implement a new value innodb_flush_log_at_trx_commit=3. This works the
      same was as innodb_flush_log_at_trx_commit=1 in terms of durability
      guarantees. However, when binlog is enabled, it does not fsync() the log
      during commit(), only during prepare() and once-per-second like
      innodb_flush_log_at_trx_commit=0, improving performance. When binlog is not
      enabled (seen inside InnoDB as commit() called without prior prepare()), then
      fsync() is still done as part of commit().

      I think innodb_flush_log_at_trx_commit=3 should be default (currently
      innodb_flush_log_at_trx_commit=1 is default), as it improves performance with
      no negative consequences. Perhaps even the values should be swapped, so that
      1 means the new behavior with no fsync() in commit, and the new 3 means the
      old behavior.

      4. Server layer changes

      The MDEV-181 introduces a list of pending binlog files and the count of not
      yet flushed XIDs in each file. We extend this so that the count is the sum of
      not yet flushed XIDs and outstanding {{commit_checkpoint_request()}}s.

      When we decide to rotate the binlog, we loop over all engines, increment the
      counter for that binlog, and call commit_checkpoint_request() (for engines
      that implement it), passing a pointer to the list entry for the corresponding
      binlog file.

      Then in commit_checkpoint_notify(), we decrement the counter again. When the
      counter drops to zero, we know that all commit_checkpoint_notify() calls have
      occured and all unlog() calls have happened, so we can log a new binlog
      checkpoint and remove the entry from the list.

      For transactions in which all participating engines implement
      commit_checkpoint_request(), we do not need unlog() at all - so we return a
      dummy cookie from log_and_order(), and do nothing in unlog(). This helps by
      removing an extra lock/unlock of the heavily contended LOCK_log inside
      unlog().

      During the actual binlog rotate, we need to ensure that the count in the
      binlog entry will not drop to zero early and cause the entry to be deleted
      before we even have time to do the first commit_checkpoint_request(). So we
      increment the counter by one extra, and decrement it after all
      commit_checkpoint_request() calls have been issued (this is mostly an
      implementation detail, the idea is best seen from the code and associated
      comments).

      5. Changes to TC_LOG_MMAP

      When storage engines no longer flush transactions in commit(), TC_LOG_MMAP
      also needs to be updated to work with this. TC_LOG_MMAP is not used much -
      only when binlog is disabled, and multiple XA-capable engines take part in the
      same transactions. This currently means PBXT and InnoDB with binlog disabled.
      But it still needs to work.

      TC_LOG_MMAP::unlog() needs to check if any engine supporting
      commit_checkpoint_request() participated in the transaction. If not, it can
      proceed to delete the xid from the {{mmap()}}ed page. But if there are any, it
      must instead put the location of the xid in the page into an in-memory list,
      to be deleted later.

      Then periodically (say every N XID, where N is number of XIDs in one page),
      unlog() will call commit_checkpoint_request() on each supporting engine to
      request a notification, and remember this request in a separate in-memory
      list. And when every engine has replied with commit_checkpoint_notify(), the
      corresponding part of the list of XID locations can then be deleted.

      6. User-level documentation:

      MariaDB X.Y introduces a performance improvement for group commit for
      InnoDB/XtraDB transactions when the binary log is enabled. When
      --innodb-flush-log-at-trx-commit=1 (the default) and binlog is enabled, there
      is now one less sync to disk inside innodb during commit (2 syncs shared
      between a group of transactions instead of 3).

      Durability of commits is not decreased - this is because even if the server
      crashes before the commit is written to disk by InnoDB, it will be recovered
      from the binlog at next server startup (and it is guaranteed that sufficient
      information is synced to disk so that such recovery is always possible).

      The old behaviour, with 3 syncs to disk per (group) commit and consequently
      lower performance, can be selected with the new
      --innodb-flush-log-at-trx-commit=3 value. There is normally no benefit from
      this, however there are a couple of edge cases to be aware of:

      • If using --flush-log-at-trx-commit=1 and --log-bin but --sync-binlog=0,
        then commits are not guaranteed durable inside InnoDB/XtraDB after
        commit. This is because events can be lost from the binlog in case of crash
        with --sync-binlog=0. In this case --innodb-flush-log-at-trx-commit=3 can
        be used to get durable commits in InnoDB/XtraDB, however one should be aware
        that a crash is nevertheless likely to cause commits to be lost in the
        binlog, leaving binlog and InnoDB inconsistent with each
        other. Thus --sync-binlog=1 is recommended, it has much less penalty in MariaDB
        5.3 and later compared to older MariaDB and MySQL.
      • An XtraBackup only sees commits that have been flushed to the redo log, so
        with the new optimisation there may be a small delay (normally at most 1
        second_ between when a commit happens and when the commit will be included
        in an XtraBackup. Note that the XtraDB backup will still be fully
        consistent with itself and the binlog. This is normally not an issue, as a
        backup usually takes many seconds, and includes all transactions committed
        up to the end of the backup, so it will be rather random anyway exactly
        which commit is or is not included in the backup. It is just mentioned here
        for completeness.

        Issue Links

          Activity

          Hide
          Kristian Nielsen added a comment -

          re-assigning for review, hope that's ok

          Show
          Kristian Nielsen added a comment - re-assigning for review, hope that's ok
          Hide
          Kristian Nielsen added a comment -

          Prepare patch, merge up to 10.0-base, push

          Show
          Kristian Nielsen added a comment - Prepare patch, merge up to 10.0-base, push

            People

            • Assignee:
              Kristian Nielsen
              Reporter:
              Kristian Nielsen
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 1 week, 1 day, 2 hours Original Estimate - 1 week, 1 day, 2 hours
                1w 1d 2h
                Remaining:
                Remaining Estimate - 0 minutes
                0m
                Logged:
                Time Spent - 3 weeks, 6 hours, 30 minutes
                3w 6h 30m