Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-4698

With GTID replication, relay logs cannot be relied upon while purging binary logs on master

    Details

    • Type: Bug
    • Status: Open
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 10.0.10
    • Fix Version/s: None
    • Component/s: None
    • Labels:

      Description

      I know from the corresponding thread on the mailing list that it is an intentional change for the sake of crash-safety, so it is just a documentation request.

      With traditional (binlog-position-based) replication it is quite possible and even reasonable to setup master binlog purging procedure based on the slave IO thread status: as soon as the IO thread is done with a master binary log and switched to the next one, all events are in the relay log, and the master binary log can be purged. It is efficient in the sense that if the slave thread is far behind, a lot of disk space can be spared by not storing the same events both in the master binlog and in the relay log; even more so if the server features the sql_delay (master_delay) functionality introduced in MySQL 5.6, and the slave is configured to keep a time gap with the master.
      It also saves the network traffic if the lagging slave gets restarted, because the local relay logs are preserved and the IO thread does not have to re-read all the events again.

      So, all in all, I expect there are real-life configurations which rely on this behavior.

      Now, with GTID the relay logs are not stored on slave restart any longer, so users must not configure their purge procedure this way, but should use SQL thread position instead. It needs to be explicitly documented, because otherwise users can experience irreversible loss of events.

        Gliffy Diagrams

          Attachments

            Issue Links

              Activity

              Hide
              knielsen Kristian Nielsen added a comment -

              Monty thinks that we need to fix GTID, so that slave can continue replication
              also in GTID mode from its relay log and not delete them/re-fetch already
              fetched events from the master at slave start.

              The main challenge regarding this is that this must be 100% crash safe,
              also without enabling extra disk syncs on the relay logs or other files.
              Probably when starting up after crash, the slave needs to do binlog recovery
              of the relay log files.

              Another challenge is to correctly handle things like START SLAVE UNTIL
              and other logic which is now done on the master while sending binlog
              to the slave. Probably some of the error handling can be omitted (as
              any error will have been already thrown on the master before the events
              were sent to the slave).

              But one needs to consider what should happen if gtid strict mode was
              disabled when the events were fetched from the master, but then later
              enabled and slave restarted on the relay logs - is it ok that strict errors
              are left unreported on events already fetched from the master?

              Show
              knielsen Kristian Nielsen added a comment - Monty thinks that we need to fix GTID, so that slave can continue replication also in GTID mode from its relay log and not delete them/re-fetch already fetched events from the master at slave start. The main challenge regarding this is that this must be 100% crash safe, also without enabling extra disk syncs on the relay logs or other files. Probably when starting up after crash, the slave needs to do binlog recovery of the relay log files. Another challenge is to correctly handle things like START SLAVE UNTIL and other logic which is now done on the master while sending binlog to the slave. Probably some of the error handling can be omitted (as any error will have been already thrown on the master before the events were sent to the slave). But one needs to consider what should happen if gtid strict mode was disabled when the events were fetched from the master, but then later enabled and slave restarted on the relay logs - is it ok that strict errors are left unreported on events already fetched from the master?

                People

                • Assignee:
                  knielsen Kristian Nielsen
                  Reporter:
                  elenst Elena Stepanova
                • Votes:
                  0 Vote for this issue
                  Watchers:
                  2 Start watching this issue

                  Dates

                  • Created:
                    Updated: