XID crash recovery across binlog boundaries


Pre-requisite: MDEV-225.

1. Overview

When binlog is enabled, after a crash the server does crash recovery to ensure
that binlog is consistent with what is committed in InnoDB (/other XA-capable
storage engines).

The server queries the storage engine for the XID for transactions that were
prepared but not committed prior to the crash. If the XID exists also in the
binlog, the transaction is committed, else it is rolled back.

To know if a given XID exists in the binlog or not, the server needs to know
from which binlog file to scan for such XIDs - as scanning all files could be
overly expensive. We will call this a "binlog checkpoint" - a binlog file such
that scanning from that file and until the end is guaranteed to find any XID
that is prepared, but not committed, in a storage engine.

Crash recovery can thus start from the last known binlog checkpoint.

Currently the last known binlog checkpoint is always the last binlog
file. This makes recovery easy (just scan the last binlog file). But it has
some disadvantages:

A. The server does an extra fsync() for every InnoDB commit (in addition to
the fsync() for InnoDB prepare and for binlog) to be sure that the XID is
fully committed before binlog is rotated.

B. The server needs to do extra locking around rotation of binlog - to ensure
that every pending prepared transaction is fully committed before going to
the next binlog file. This can cause some server stalls, and also creates
complications for some other features that would be nice to implement later
(see eg. http://bugs.mysql.com/bug.php?id=44058 and

This task is about improving how the binlog checkpoint is determined for
recovery, to eliminate these disadvantages.

2. The binlog checkpoint event

Basically, we will allow the latest binlog checkpoint to be any binlog file,
not just the last one. So recovery will sometimes have to scan more than one
binlog. Though most of the time only the last binlog file will be needed, as
eg. InnoDB flushes its logs at least once every second anyway.

When every XA commit in a binlog file has been flushed to disk in every
engine, the binlog file is no longer needed for XA recovery. We will then
write into the current binlog file a new binlog checkpoint event; this event
contains the new binlog checkpoint - ie. the name of the first binlog file
containing XIDs that have not yet been flushed to disk in all engines. This
will usually be the file following the binlog file that had all commits
flushed - in fact it will almost always be the current binlog file, unless
commit flush is heavily delayed/out-of-order or binlog file size is very

Additionally, we will write a binlog checkpoint event at the beginning of
every new binlog file created.

Then we will extend XA crash recovery. We start as usually, by scanning the
last binlog file found for XIDs. When we finish scanning, we check the last
binlog checkpoint seen. If this is the file just scanned, we are done - we can
proceed to recover XIDs in each engine. But if the last binlog checkpoint is
an older binlog, then we go back and scan that one also before recovering XIDs
(as well as any following that, though usually there will be at most two to

If we do not find any binlog checkpoint event in the last binlog during crash
recovery, then this means that the binlog was written by an old server, which
always has the most recent binlog as binlog checkpoint, so we can proceed with
XID recovery immediately.

The new binlog checkpoint event will be marked informational. This allows
slaves that do not know the event (such as MySQL 5.6) to safely ignore the
event without halting replication. Older slaves (prior to 5.6) do not support
informational events - for those, we will use to convert the binlog
checkpoint event to a dummy event that will allow such older slaves to proceed
without breaking replication.

3. Determining when a new binlog checkpoint is reached

Whenever a new binlog file is created, we will link it into a list maintained
in-memory. In this list, we maintain a count of XIDs not yet fully committed
in each binlog file, to replace the global counter used prior to this task.

When we (group) commit one or more XIDs to the binlog, we increase the counter
for the corresponding binlog file, and return an identifier for that binlog as
the cookie from log_and_order(). Then in unlog() we subtract one from the same
counter. When the counter reaches zero in unlog() for a binlog that is not the
newest one actively being written, we write out a new binlog checkpoint event
and remove the now fully flushed binlog file from the in-memory list.

Note that if we somehow manage to flush all commits in binlog (N-1) before the
last commit in binlog (N-2), we may need to remove two (or more) binlog file
entries at a time and go directly from binlog checkpoint (N-2) to N.

4. Binlog rotation

With the above in place, we can now remove the extra locking needed when
rotating the binlog, eliminating disadvantage (B).

MYSQL_BIN_LOG::new_file_impl(), we no longer need to wait for prepared_xids()
to drop to zero - we can always rotate immediately without waiting.

This way, we eliminate the stalls where the binlog rotate is waiting for XIDs
to be unlogged while it is holding the expensive LOCK_log. And this also
removes the need to work-around the potential deadlock of MySQL Bug#44058 when
implementing MDEV-162.

Binlog purge must be extended so it will not purge any log that still has
outstanding XIDs (log_in_use()).

For eliminating disadvantage (A), extra fsync() for every storage engine
commit, see MDEV-232.

5. Testing

To test, we will need to add some crash recovery testing - this is anyway
rather lacking in the current test suite. We will test such things as crashing
with latest binlog checkpoint at different positions - in latest binlog, or 1 or
2 binlogs back (with small binlog size). And we will test with different
number of commits in different stages - not prepared / prepared / written into
binlog / committed-not-durable / committed. And we will test multiple
recovery - when we crash again at different places while doing crash
recovery. Such testing can be done by combining debug_sync (to ensure
different commits are progressed to the desired stage) with dbug crash


Kristian Nielsen


Kristian Nielsen


Time tracking


Fix versions