Details

    • Type: Task Task
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Fix Version/s: 10.0.4
    • Labels:
      None
    • Global Rank:
      26

      Description

      Global transaction ID

      The purpose of Global transaction ID (GTID) is to make slave position
      independent of internal details of master's binlog (file name, file
      offset). This allows a simpler switch of a slave to a new master, as the
      current slave position is valid on the new master as well as the old.

      The GTID is some kind of a tag, that is attached to every transaction. The
      point is — it must be globally unique, it must go into the binary log together
      with the transaction itself, it must be replicated to slaves, and it must be
      preserved when a transaction is applied on the slave. That is no matter how
      many slaves the transaction was replicated through (in a complex replication
      graph), on every slave and in all binary logs on slaves the transaction should
      have the same GTID.

      As part of this task, we will also make the slave replication state crash safe
      (can be recovered after a crash in a transactionally safe way).

      Gtid_log_event

      A global trasaction id is a pair (server_id, seq_no). The server_id already exists
      to identify events originating at different servers. The seq_no is new, it is
      a 64-bit unsigned integer that increases monotonically (not necessarily
      without holes) at each commit on the master.

      Every binlog event group (eg. transaction, DDL, non-transactional statement)
      is annotated with its global transaction id. Most event groups are already
      bracketed with BEGIN/COMMIT events. We replace the BEGIN event with a new
      Gtid_Log_Event event. This event contains the seq_no of the global transaction
      id for the following event group (the server_id is stored in the event header
      of every event). We also include some flag bits.

      On the slave, the Gtid_Log_Event is applied like the BEGIN query
      event. However, the seq_no from the event is remembered, and preserved when
      the replicated event is binlogged, just like server_id currently is. A session
      variable pseudo_seq_no is introduced (requires SUPER to change) to similarly
      override seq_no, and used by mysqlbinlog to preserve gtid for
      mysqlbinlog|mysql style binlog apply.

      For event groups that currently have no BEGIN/COMMIT around them, a
      Gtid_Log_event is added before it. This event has a flag set to let the slave
      know that no COMMIT will follow, so it just applies to the following event
      group. This is for eg. DDL and certain out-of-band events like TRUNCATE of
      HEAP table after master restart.

      Old MariaDB slaves (or MySQL slaves) will not understand the new
      Gtid_log_event. We handle this using the existing mechanism for replacing
      events that old slaves cannot handle. For Gtid_log_event with no corresponding
      COMMIT we can just omit it or replace with a dummy event as appropriate. For
      the case with a corresponding COMMIT, we replace the Gtid_log_event with a
      normal BEGIN query event.

      To be able to do this, we make the Gtid_log_event be 38 bytes (19 bytes header
      + 19 bytes body), reserving a couple unused bytes for future expansin, as this
      is the minimum size for a BEGIN query event. Incidentally the current BEGIN
      event takes 68 bytes due to redundant information, so we still get 30 bytes
      space saved for every event group by introducing GTID.

      GTID in binlog

      Whenever we write a new event group to the binlog, we allocate the next seq_no
      and put it in a Gtid_log_event at the start of the group. When we write
      through a cache, we put 0 there and back-patch the proper seq_no when we write
      the cache to the binlog, so that seq_no ordering follows commit order and
      binlog order (per server-id).

      When the server shuts down, we write (and fsync()) the current seq_no to a
      file gtid.info in the data directory. When we startup, we read it back in to
      resume from the correct number. We fsync() the gtid.info file before marking
      the last binlog file as cleanly closed. And if at startup we detect that last
      binlog was not cleanly closed, we do the normal binlog crash recovery, and as
      part of that extract the last used seq_no from the events seen there, instead
      of relying on the (probably corrupt) gtid.info.

      The server remembers (eg. in a hash) the last seq_no seen for every
      server_id. When a new binlog file is written, this list of GTIDs is written
      out at the start of the binlog as a new Gtid_list_log_event.

      This allows to find the location in the server binlogs of any given GTID (server_id,
      seq_no): start from the last binlog file, and scan backwards. For each binlog
      file, read the Gtid_list_log_event at the start. If server_id is found with a
      lower-or-equal seq_no, then the GTID is found in this file, and we can scan
      forward until we find it. If server_id is found with a bigger seq_no, then the GTID
      exists in an earlier binlog file (or has been purged if this is true for every
      binlog file). If server_id is not found in the Gtid_list_log_event, then the GTID
      was never seen by this server.

      Slave replication state

      By the slave replication state, we denote what currently is the master binlog
      file name and file offset of the last event group applied on a slave. This is
      needed when reconnecting a slave to a master to resume replication at the
      correct point in the binlog event stream and not lose or duplicate any events.

      When GTIDs are used, the slave replication state is instead the GTID of the
      last event group applied on the slave. However, because of multi-master
      replication (and later perhaps parallel replication that re-orders events),
      the replication state becomes a set of GTIDs last applied.

      The slave remembers the GTID with the maximal seq_no for every server_id value
      of every event group applied. If an event is received with lower-or-equal
      seq_no than already applied for that server_id, then the event is ignored
      (same as currently when receiving an event with own server_id). This avoids
      duplicating events in circular replication topologies.

      The slave replication state (list of GTIDs) is written to gtid.info at
      shutdown, and also written at the start of every slave binlog file. This makes
      it possible to recover after a slave crash, during the scan of the last
      binlog, same way as for seq_no on a master. We can in addition store it in
      relay-log.info if --log-slave-updates=OFF, so slaves can run with binlog
      disabled, however this will be as crash-unsafe as current replication.

      When the slave connects to the master with GTID enabled, it no longer needs to
      send master binlog file name and file offset. Instead it sends its replication
      state as a list of GTIDs. The master will search back through its binlogs as
      described above for the earliest of these GTIDs, and start sending events from
      that point on back to the slave. Note that if the server_id of a GTID has
      never been seen by a master, it can be ignored; however if it is seen in some
      Gtid_list_log_event but has been purged, then it is an error and slave connect
      fails (same as currently if slave asks for a binlog file that has been
      purged).

      When a slave connects using the old style (binlog file name and file offset),
      the master will send back the replication state corresponding to this
      position. This allows to automatically migrate to GTID; the next time the
      slave reconnects, it can use the replication state obtained from previous
      connect. It also allows to provision a new slave from a backup made with
      mysqldump --master-data or XtraDB. These provide the old-style binlog file
      name and positions to use — and after first connect to the master, the slave
      can automatically switch over to use global transaction IDs.

      Note that the list of GTIDs kept on master and on slave is in fact identical,
      even though it is used for different purposes on master and slave. Of course,
      it is possible for a server to be both a master and a slave, and use the
      information for both purposes.

      User interface

      The SHOW SLAVE STATUS command needs to be extended to also show the
      GTID-enabled replication state, which is the set of GTIDs with maximal seq_no
      per server-id applied on the slave. The SHOW MASTER STATUS should also show
      it.

      The CHANGE MASTER TO command can be used as before. If a GTID replication
      state is available, and no explicit filename/offset is given, and master
      supports GTID, then GTID will be used to automatically start from the correct
      event group.

      CHANGE MASTER should also be extended to allow to specify the current
      replication state. This should not normally be needed, but can be useful to
      experiment or recover from fatal server loss or corruption, etc. Specifying a
      GTID sets the maximal seq_no for the given server id. Specifying NULL for
      set_no in the GTID removes the server_id from the GTID list, leaving things as
      if that server_id was never seen before on the server.

      The slave_skip_counter can be used as before. It still records the GTID of any
      event group skipped.

      START SLAVE UNTIL is extended in syntax to take a GTID. It stops the slave
      when a GTID with same server_id and greater-or-equal seq_no is reached (if
      equal, the event is applied on the slave before stopping). If given a list of
      GTIDs, stops when any of them is reached. (Stopping when all GTIDs in a list
      have been reached can be achieved by a sequence of START SLAVE UNTIL
      commands).

      Switching to a new slave

      Suppose we have a number of slaves replicating off of one master (or several
      masters with multi-source replication). GTIDs make it simpler to switch to
      using one of the slaves as a new master (because the original master died or
      is taken down for maintenance or whatever).

      In the simple case where there is no multi-source replication (and no parallel
      replication that reorders transactions, if that is later implemented), then
      the event stream is completely linear, and with the same sequence in every
      slave at any point in the replication hierarchy. Then if one runs SHOW SLAVE
      STATUS on each slave, there will be one slave that has greater-or-equal seq_no
      for each server_id than any other slave.

      Now we can simply promote that slave (there may be several equal to choose
      from) as the new master. All the other slaves can simply do CHANGE MASTER TO,
      specifying the connection details of the new master, and GTID will ensure that
      they continue at the correct position.

      Of course if the master can be stopped gracefully while switching slave, we
      can just let all slaves run until they have all replicated everything from the
      master. Then any slave can be promoted as the new master.

      If using multi-source (or possibly later implemented parallel replication),
      then matters can be more complex (but remember that this is not the common
      case — only multi-source replication, where such switch of master is probably
      uncommon, and any later implemented parallel replication).

      Let us say that for two GTIDs we have

      (sid1,seq_no1) <= (sid2,seq_no2) iff sid1==sid2 && seq_no1 <= seq_no2

      It is possible that if the old master disappears abruptly (ie. crashes), then
      for every pair of slaves S1, S2, the replication state of S1 and S2 may be
      in-commensurable: There is a GTID1 on S1 with no greater GTID on S2, and
      likewise a GTID2 on S2 with no greater GITD on S1.

      In this case no slave is immediately ready to take over as master without some
      event groups getting missing. However, we can still do a reliable master
      switch, as follows:

      First, arbitrarily pick any slave as the new master. Obtain the replication
      state (SHOW SLAVE STATUS) of all other slaves. The idea is that we will
      replicate all missing changes from every other slave to the newly selected
      master to make sure it has everything needed to fullfill the master role.

      For each slave S, we do a CHANGE MASTER TO on the server selected as new
      master. Then we START SLAVE UNTIL <gtid> for every gtid in the replication
      state of S. This ensures that we have every event group seen by S on the new
      master. By repeating for every slave server, we end up with a new master
      server that will have a replication state that is a superset of all the
      remaining slaves. We can then simply CHANGE MASTER TO on all the slaves, and
      continue.

      As part of this worklog, I will write a script that reads a list of connection
      strings to the set of slave servers, and goes through the above procedure,
      resulting in the first server on the list being promoted as the new master and
      every other slave on the list changed to replicate from the new master.

      Comparison with MySQL 5.6 global transaction ID

      The main motivation for this design is my dislike for the MySQL 5.6 global
      transaction ID design.

      Despite all of its flaws (mainly lack of robustness), MySQL replication has
      been extremely successful. I believe the reason is that is transparent, in the
      sense that the way it works is conceptually simple to understand, and thus
      possible to tweak and manipulate by users. Replication consists simply of
      sending a stream of the changes done on a master to the slaves to be
      repeated. The slave replication state is simply the position in the stream.

      The design explained here preserves this conceptual simplicity. The
      replication state is still just the position in the stream. We just have a
      universal way to refer to that position that works on all servers across the
      replication topology. It is still possible to tweak and manipuate that
      position (slave_skip_counter, START SLAVE UNTIL, etc.)

      Multi-source replication makes the state more complex, since we have now a set
      of positions, but it is still conceptually sane.

      The MySQL 5.6 design, in contrast, loses the simple concept of replication
      state. The state is now an abstract set of all GTIDs ever applied on a slave,
      which is harder to grasp and manipulate. The concept of position in a
      replication stream becomes meaningless, as parallel replication can
      arbitrarily re-order events in the binlogs at different levels of the
      replication topology.

      (See below for plans on how to extend the design presented here to handle
      MySQL-style parallel replication in a way that tries to preserve the nice
      properties).

      Comparison with the Google transaction ID patch

      I think this approach is rather similar in concept to the Google transaction
      ID patch. The main differences are probably:

      • The use of the binlog for persisting the replication state in a crash-safe
        manner. I believe this is a better approach than what the Google patch
        does.
      • The extensions to handle multi-source replication, which is necessary as
        this feature has been introduced to MariaDB.

      Future expansion for parallel replication

      Parallel replication is not a part of this design. This task can be fully
      implemted as described here in a self-consistent way. However, parallel
      replication is an important feature, and this section describes how to plan
      for being able to extend GTIDs later in a nice way to handle parallel
      replication.

      Now, parallel replication can come in two variants. One is where transactions
      are run in parallel, but still committed in the same order on the slave as on
      the master. This in-order parallelism has no conflicts with the GTID design
      described here, and can be implemented independently. MWL#184 is an example of
      this.

      The other out-of-order variant is what MySQL 5.6 does. Transactions are
      committed on a slave in different order than on the master, so the slave
      binlog has transactions in a different order than the master. But this kind of
      parallelism can potentially obtain higher degree on the slave, so can be
      desirable.

      Multi-source replication is related to out-of-order parallelism. On a
      multi-source slave, event groups are applied in parallel, but written
      interleaved with one another in some arbitrary way in a single
      binlog. So a multi-source slave S1 could itself be a master for a deeper-down
      slave S2, but currently events on S2 would have to be replicated in-order with
      no parallelism, which can be prohibitively expensive.

      With multi-source we have extra information available on S1: We know that
      events from two different upstream masters have no predefined ordering, while
      events from a single upstream master must be applied in the order given. If we
      record that information in the binlog of S1, then S2 can use that information
      to know that it is safe for it to also apply events from distinct uptream
      masters in parallel, but keep the ordering of events belonging to a single
      upstream master.

      If we do it this way, then we retain the nice conceptual understanding of a
      GTID as a well-defined position in a replication stream. If two GTIDs
      originate from the same upstream master, they have a well-defined ordering
      which will be preserved all across the replication topology.

      In effect we now have multiple replication streams in each binlog. Between
      different replication streams, there is no ordering implied, however within a
      single replication stream GTIDs uniquely and consistently define a simple
      linear order.

      We can extend this by allowing multiple user-defined streams originating at a
      single server, with the application having the responsibility of ensuring that
      different streams are really independent. For example, we could create N
      streams based on a hash of the used database, to get different databases
      replicated in parallel just as the MySQL 5.6 MTS (multi-threaded slave)
      feature. But we still retain the concept of GTID as position in a linear
      stream — just with multiple streams possible.

      The full design of this will be written up in a different task. But we can
      prepare the global transaction ID design to be more easily extensible towards
      multiple binlog streams, without much extra effort.

      Whenever we store a global transaction ID (server_id, seq_no), we also store a
      replication stream ID. So we store this in Gtid_log_event and
      Gtid_list_log_event, write it in gtid.info (and relay-log.info), and remember
      it as part of the slave replication state. We can use a 32-bit unsigned
      integer as replication stream id.

      For the implemetation of GTID, this will always be zero. Later, we can
      implement that multi-source replication can assign different replication
      stream id to events from different upstream masters. The idea is that two
      events groups with distinct replication stream ID can be replicated in
      parallel and committed in any order to the binlog. This will allow a
      downstream slave S2 to also parallelise events from different upstream
      masters.

      Similarly, by allowing applications to annotate different transactions with
      different stream IDs, we can achieve the same kind of parallel replication as
      MySQL 5.6, in a way that is both simpler and more flexible.

        Issue Links

          Activity

          Hide
          Kristian Nielsen added a comment -

          We can simplify the design somewhat if we put some extra requirements on the
          use of multi-source replication with global transaction ID.

          We might require that if global transaction ID will be used with multi-source,
          then each of the multiple masters must configure a different stream_id. This
          way, it gets explicitly declared in the binlogs that the streams are
          replicated independently.

          Then there is no longer a need for the replication state to remember the
          last seq_no applied for every server_id ever seen. Instead, we only need to
          remember seq_no for each stream_id seen.

          This makes the replication stream much simpler, especially when multi-source
          replication is not used.

          For example, in the common setup with just one master and (N-1) slaves, then
          the replication state is just a single global transaction ID, and the binlogs
          on all the servers always have transactions in the save order. This makes it
          conceptually much easier to understand where a slave is replicating currently,
          and to provision manually a new slave.

          Show
          Kristian Nielsen added a comment - We can simplify the design somewhat if we put some extra requirements on the use of multi-source replication with global transaction ID. We might require that if global transaction ID will be used with multi-source, then each of the multiple masters must configure a different stream_id. This way, it gets explicitly declared in the binlogs that the streams are replicated independently. Then there is no longer a need for the replication state to remember the last seq_no applied for every server_id ever seen. Instead, we only need to remember seq_no for each stream_id seen. This makes the replication stream much simpler, especially when multi-source replication is not used. For example, in the common setup with just one master and (N-1) slaves, then the replication state is just a single global transaction ID, and the binlogs on all the servers always have transactions in the save order. This makes it conceptually much easier to understand where a slave is replicating currently, and to provision manually a new slave.

            People

            • Assignee:
              Kristian Nielsen
              Reporter:
              Rasmus Johansson
            • Votes:
              2 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - 0 minutes
                0m
                Remaining:
                Time Spent - 6 weeks, 2 days, 7 hours, 15 minutes Remaining Estimate - 5 weeks, 4 days, 4 hours, 45 minutes
                5w 4d 4h 45m
                Logged:
                Time Spent - 6 weeks, 2 days, 7 hours, 15 minutes Remaining Estimate - 5 weeks, 4 days, 4 hours, 45 minutes
                6w 2d 7h 15m