MariaDB Development
  1. MariaDB Development
  2. MDEV-181

XID crash recovery across binlog boundaries

    Details

    • Type: Task Task
    • Status: Closed
    • Priority: Critical Critical
    • Resolution: Fixed
    • Fix Version/s: 10.0.0
    • Labels:
    • Global Rank:
      274

      Description

      Pre-requisite: MDEV-225.

      1. Overview

      When binlog is enabled, after a crash the server does crash recovery to ensure
      that binlog is consistent with what is committed in InnoDB (/other XA-capable
      storage engines).

      The server queries the storage engine for the XID for transactions that were
      prepared but not committed prior to the crash. If the XID exists also in the
      binlog, the transaction is committed, else it is rolled back.

      To know if a given XID exists in the binlog or not, the server needs to know
      from which binlog file to scan for such XIDs - as scanning all files could be
      overly expensive. We will call this a "binlog checkpoint" - a binlog file such
      that scanning from that file and until the end is guaranteed to find any XID
      that is prepared, but not committed, in a storage engine.

      Crash recovery can thus start from the last known binlog checkpoint.

      Currently the last known binlog checkpoint is always the last binlog
      file. This makes recovery easy (just scan the last binlog file). But it has
      some disadvantages:

      A. The server does an extra fsync() for every InnoDB commit (in addition to
      the fsync() for InnoDB prepare and for binlog) to be sure that the XID is
      fully committed before binlog is rotated.

      B. The server needs to do extra locking around rotation of binlog - to ensure
      that every pending prepared transaction is fully committed before going to
      the next binlog file. This can cause some server stalls, and also creates
      complications for some other features that would be nice to implement later
      (see eg. http://bugs.mysql.com/bug.php?id=44058 and
      http://askmonty.org/worklog/Server-RawIdeaBin/?tid=164).

      This task is about improving how the binlog checkpoint is determined for
      recovery, to eliminate these disadvantages.

      2. The binlog checkpoint event

      Basically, we will allow the latest binlog checkpoint to be any binlog file,
      not just the last one. So recovery will sometimes have to scan more than one
      binlog. Though most of the time only the last binlog file will be needed, as
      eg. InnoDB flushes its logs at least once every second anyway.

      When every XA commit in a binlog file has been flushed to disk in every
      engine, the binlog file is no longer needed for XA recovery. We will then
      write into the current binlog file a new binlog checkpoint event; this event
      contains the new binlog checkpoint - ie. the name of the first binlog file
      containing XIDs that have not yet been flushed to disk in all engines. This
      will usually be the file following the binlog file that had all commits
      flushed - in fact it will almost always be the current binlog file, unless
      commit flush is heavily delayed/out-of-order or binlog file size is very
      short.

      Additionally, we will write a binlog checkpoint event at the beginning of
      every new binlog file created.

      Then we will extend XA crash recovery. We start as usually, by scanning the
      last binlog file found for XIDs. When we finish scanning, we check the last
      binlog checkpoint seen. If this is the file just scanned, we are done - we can
      proceed to recover XIDs in each engine. But if the last binlog checkpoint is
      an older binlog, then we go back and scan that one also before recovering XIDs
      (as well as any following that, though usually there will be at most two to
      scan).

      If we do not find any binlog checkpoint event in the last binlog during crash
      recovery, then this means that the binlog was written by an old server, which
      always has the most recent binlog as binlog checkpoint, so we can proceed with
      XID recovery immediately.

      The new binlog checkpoint event will be marked informational. This allows
      slaves that do not know the event (such as MySQL 5.6) to safely ignore the
      event without halting replication. Older slaves (prior to 5.6) do not support
      informational events - for those, we will use MDEV-181 to convert the binlog
      checkpoint event to a dummy event that will allow such older slaves to proceed
      without breaking replication.

      3. Determining when a new binlog checkpoint is reached

      Whenever a new binlog file is created, we will link it into a list maintained
      in-memory. In this list, we maintain a count of XIDs not yet fully committed
      in each binlog file, to replace the global counter used prior to this task.

      When we (group) commit one or more XIDs to the binlog, we increase the counter
      for the corresponding binlog file, and return an identifier for that binlog as
      the cookie from log_and_order(). Then in unlog() we subtract one from the same
      counter. When the counter reaches zero in unlog() for a binlog that is not the
      newest one actively being written, we write out a new binlog checkpoint event
      and remove the now fully flushed binlog file from the in-memory list.

      Note that if we somehow manage to flush all commits in binlog (N-1) before the
      last commit in binlog (N-2), we may need to remove two (or more) binlog file
      entries at a time and go directly from binlog checkpoint (N-2) to N.

      4. Binlog rotation

      With the above in place, we can now remove the extra locking needed when
      rotating the binlog, eliminating disadvantage (B).

      MYSQL_BIN_LOG::new_file_impl(), we no longer need to wait for prepared_xids()
      to drop to zero - we can always rotate immediately without waiting.

      This way, we eliminate the stalls where the binlog rotate is waiting for XIDs
      to be unlogged while it is holding the expensive LOCK_log. And this also
      removes the need to work-around the potential deadlock of MySQL Bug#44058 when
      implementing MDEV-162.

      Binlog purge must be extended so it will not purge any log that still has
      outstanding XIDs (log_in_use()).

      For eliminating disadvantage (A), extra fsync() for every storage engine
      commit, see MDEV-232.

      5. Testing

      To test, we will need to add some crash recovery testing - this is anyway
      rather lacking in the current test suite. We will test such things as crashing
      with latest binlog checkpoint at different positions - in latest binlog, or 1 or
      2 binlogs back (with small binlog size). And we will test with different
      number of commits in different stages - not prepared / prepared / written into
      binlog / committed-not-durable / committed. And we will test multiple
      recovery - when we crash again at different places while doing crash
      recovery. Such testing can be done by combining debug_sync (to ensure
      different commits are progressed to the desired stage) with dbug crash
      insertion.

        Issue Links

          Activity

          Hide
          Kristian Nielsen added a comment -

          pushed to 10.0-base

          Show
          Kristian Nielsen added a comment - pushed to 10.0-base

            People

            • Assignee:
              Kristian Nielsen
              Reporter:
              Kristian Nielsen
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 1 week, 1 day, 2 hours Original Estimate - 1 week, 1 day, 2 hours
                1w 1d 2h
                Remaining:
                Time Spent - 2 weeks, 1 day, 5 hours Remaining Estimate - 2 days, 30 minutes
                2d 30m
                Logged:
                Time Spent - 2 weeks, 1 day, 5 hours Remaining Estimate - 2 days, 30 minutes
                2w 1d 5h