We're updating the issue view to help you get more done.Learn more

Remove one fsync() inside engine's commit() method

Pre-requisite: MDEV-181 Closed

1. Overview:

When binlog is enabled and an InnoDB/XtraDB transaction is committed, we
currently need no less than three calls to fsync():

  1. After the InnoDB prepare() step, to ensure that the transaction will not be
    lost in a crash after being written into the binlog.
  2. After writing the event to the binlog, to ensure that we will not commit in
    InnoDB a transaction that will subsequently be lost from binlog due to
    crash.
  3. After the InnoDB commit() step, to know that XA recovery will no longer
    need to consider the XID for this transaction during crash recovery.

This task builds on MDEV-181 Closed to remove the fsync() in step 3, without reducing
any durability or consistency guarantees.

The basic issue is that we need some way for storage engines to communicate to
the server layer when a commit is durably flushed to disk. Once the server
knows that all commits in a particular binlog file are durably flushed in all
storage engines, it can write a new binlog checkpoint event, as the old file
will then no longer be needed for crash recovery.

Currently this communication is implicit in the commit() method of the
handlerton: the engine guarantees that the transaction will be durably flushed
during commit().

But for engines that support the commit_ordered() interface, we know that
commits will be done in the same order in the binlog and in InnoDB. So the
server layer does not need to be informed of the flush of every transaction.
All we need is to know when the last transaction in a binlog file is durably
flushed to disk - then we are sure that all prior transactions are also
flushed, and that the binlog file will no longer be needed for crash recovery,
so we can log a new binlog checkpoint event.

It might be possible to also handle storage engines that do not support
commit_ordered(), but we do not consider this in this task.

Thus, the basic idea is

  • When the server layer decides to rotate the current binlog, it queues to
    each supporting engine a request to be informed when the last transaction
    in the binlog is durably flushed to disk.
  • When an engine has flushed to disk the last commit in a binlog, it notifies
    the server asynchronously of the fact. When all engines have so informed
    the server layer, a new binlog checkpoint event is written.

2. Extensions to the storage engine interface

We will add a new optional method to the storage engine handlerton:

    int (*commit_checkpoint_request)(void *cookie);

This will be called after the last commit is written into the binlog. The
engine must queue this request and later inform the server layer
asynchronously when all transactions committed prior to the call of
commit_checkpoint_request() are durably flushed to disk (if such flush already
happened, the engine can inform the server immediately during the call).

The engine informs the server layer by calling a new server function

    void commit_checkpoint_notify(handlerton, cookie);

passing the same cookie value that it received in commit_checkpoint_request().

Engines that implement this method are no longer required to fsync() to disk
before returning from commit() of a two-phase-committed transaction (though
such fsync() is still needed in one-phase commit where no prepare() step is
done). The server layer will ensure that the transaction will be recovered
from the binlog if we crash.

3. InnoDB extensions

In InnoDB and XtraDB, we will implement the new commit_checkpoint_request()
method to simply flush the InnoDB redo log and return.

If desired, we can later refine this so that such flush is done
asynchronously. Then commit_checkpoint_request() would just queue the flush
request in a list - and when the redo log is next flushed (for example by the
once-per-second background flush or a prepare() or InnoDB checkpoint), the
list could be scanned and the appropriate entries removed from the list and
reported as flushed.

As commit_checkpoint_request() only happens once per binlog rotation, this is
not likely to be performance critical.

We will implement a new value innodb_flush_log_at_trx_commit=3. This works the
same was as innodb_flush_log_at_trx_commit=1 in terms of durability
guarantees. However, when binlog is enabled, it does not fsync() the log
during commit(), only during prepare() and once-per-second like
innodb_flush_log_at_trx_commit=0, improving performance. When binlog is not
enabled (seen inside InnoDB as commit() called without prior prepare()), then
fsync() is still done as part of commit().

I think innodb_flush_log_at_trx_commit=3 should be default (currently
innodb_flush_log_at_trx_commit=1 is default), as it improves performance with
no negative consequences. Perhaps even the values should be swapped, so that
1 means the new behavior with no fsync() in commit, and the new 3 means the
old behavior.

4. Server layer changes

The MDEV-181 Closed introduces a list of pending binlog files and the count of not
yet flushed XIDs in each file. We extend this so that the count is the sum of
not yet flushed XIDs and outstanding {{commit_checkpoint_request()}}s.

When we decide to rotate the binlog, we loop over all engines, increment the
counter for that binlog, and call commit_checkpoint_request() (for engines
that implement it), passing a pointer to the list entry for the corresponding
binlog file.

Then in commit_checkpoint_notify(), we decrement the counter again. When the
counter drops to zero, we know that all commit_checkpoint_notify() calls have
occured and all unlog() calls have happened, so we can log a new binlog
checkpoint and remove the entry from the list.

For transactions in which all participating engines implement
commit_checkpoint_request(), we do not need unlog() at all - so we return a
dummy cookie from log_and_order(), and do nothing in unlog(). This helps by
removing an extra lock/unlock of the heavily contended LOCK_log inside
unlog().

During the actual binlog rotate, we need to ensure that the count in the
binlog entry will not drop to zero early and cause the entry to be deleted
before we even have time to do the first commit_checkpoint_request(). So we
increment the counter by one extra, and decrement it after all
commit_checkpoint_request() calls have been issued (this is mostly an
implementation detail, the idea is best seen from the code and associated
comments).

5. Changes to TC_LOG_MMAP

When storage engines no longer flush transactions in commit(), TC_LOG_MMAP
also needs to be updated to work with this. TC_LOG_MMAP is not used much -
only when binlog is disabled, and multiple XA-capable engines take part in the
same transactions. This currently means PBXT and InnoDB with binlog disabled.
But it still needs to work.

TC_LOG_MMAP::unlog() needs to check if any engine supporting
commit_checkpoint_request() participated in the transaction. If not, it can
proceed to delete the xid from the {{mmap()}}ed page. But if there are any, it
must instead put the location of the xid in the page into an in-memory list,
to be deleted later.

Then periodically (say every N XID, where N is number of XIDs in one page),
unlog() will call commit_checkpoint_request() on each supporting engine to
request a notification, and remember this request in a separate in-memory
list. And when every engine has replied with commit_checkpoint_notify(), the
corresponding part of the list of XID locations can then be deleted.

6. User-level documentation:

MariaDB X.Y introduces a performance improvement for group commit for
InnoDB/XtraDB transactions when the binary log is enabled. When
--innodb-flush-log-at-trx-commit=1 (the default) and binlog is enabled, there
is now one less sync to disk inside innodb during commit (2 syncs shared
between a group of transactions instead of 3).

Durability of commits is not decreased - this is because even if the server
crashes before the commit is written to disk by InnoDB, it will be recovered
from the binlog at next server startup (and it is guaranteed that sufficient
information is synced to disk so that such recovery is always possible).

The old behaviour, with 3 syncs to disk per (group) commit and consequently
lower performance, can be selected with the new
--innodb-flush-log-at-trx-commit=3 value. There is normally no benefit from
this, however there are a couple of edge cases to be aware of:

  • If using --flush-log-at-trx-commit=1 and --log-bin but --sync-binlog=0,
    then commits are not guaranteed durable inside InnoDB/XtraDB after
    commit. This is because events can be lost from the binlog in case of crash
    with --sync-binlog=0. In this case --innodb-flush-log-at-trx-commit=3 can
    be used to get durable commits in InnoDB/XtraDB, however one should be aware
    that a crash is nevertheless likely to cause commits to be lost in the
    binlog, leaving binlog and InnoDB inconsistent with each
    other. Thus --sync-binlog=1 is recommended, it has much less penalty in MariaDB
    5.3 and later compared to older MariaDB and MySQL.
  • An XtraBackup only sees commits that have been flushed to the redo log, so
    with the new optimisation there may be a small delay (normally at most 1
    second_ between when a commit happens and when the commit will be included
    in an XtraBackup. Note that the XtraDB backup will still be fully
    consistent with itself and the binlog. This is normally not an issue, as a
    backup usually takes many seconds, and includes all transactions committed
    up to the end of the backup, so it will be rather random anyway exactly
    which commit is or is not included in the backup. It is just mentioned here
    for completeness.

Status