We're updating the issue view to help you get more done. 

Global transaction ID

Description

Global transaction ID

The purpose of Global transaction ID (GTID) is to make slave position
independent of internal details of master's binlog (file name, file
offset). This allows a simpler switch of a slave to a new master, as the
current slave position is valid on the new master as well as the old.

The GTID is some kind of a tag, that is attached to every transaction. The
point is — it must be globally unique, it must go into the binary log together
with the transaction itself, it must be replicated to slaves, and it must be
preserved when a transaction is applied on the slave. That is no matter how
many slaves the transaction was replicated through (in a complex replication
graph), on every slave and in all binary logs on slaves the transaction should
have the same GTID.

As part of this task, we will also make the slave replication state crash safe
(can be recovered after a crash in a transactionally safe way).

Gtid_log_event

A global trasaction id is a pair (server_id, seq_no). The server_id already exists
to identify events originating at different servers. The seq_no is new, it is
a 64-bit unsigned integer that increases monotonically (not necessarily
without holes) at each commit on the master.

Every binlog event group (eg. transaction, DDL, non-transactional statement)
is annotated with its global transaction id. Most event groups are already
bracketed with BEGIN/COMMIT events. We replace the BEGIN event with a new
Gtid_Log_Event event. This event contains the seq_no of the global transaction
id for the following event group (the server_id is stored in the event header
of every event). We also include some flag bits.

On the slave, the Gtid_Log_Event is applied like the BEGIN query
event. However, the seq_no from the event is remembered, and preserved when
the replicated event is binlogged, just like server_id currently is. A session
variable pseudo_seq_no is introduced (requires SUPER to change) to similarly
override seq_no, and used by mysqlbinlog to preserve gtid for
mysqlbinlog|mysql style binlog apply.

For event groups that currently have no BEGIN/COMMIT around them, a
Gtid_Log_event is added before it. This event has a flag set to let the slave
know that no COMMIT will follow, so it just applies to the following event
group. This is for eg. DDL and certain out-of-band events like TRUNCATE of
HEAP table after master restart.

Old MariaDB slaves (or MySQL slaves) will not understand the new
Gtid_log_event. We handle this using the existing mechanism for replacing
events that old slaves cannot handle. For Gtid_log_event with no corresponding
COMMIT we can just omit it or replace with a dummy event as appropriate. For
the case with a corresponding COMMIT, we replace the Gtid_log_event with a
normal BEGIN query event.

To be able to do this, we make the Gtid_log_event be 38 bytes (19 bytes header
+ 19 bytes body), reserving a couple unused bytes for future expansin, as this
is the minimum size for a BEGIN query event. Incidentally the current BEGIN
event takes 68 bytes due to redundant information, so we still get 30 bytes
space saved for every event group by introducing GTID.

GTID in binlog

Whenever we write a new event group to the binlog, we allocate the next seq_no
and put it in a Gtid_log_event at the start of the group. When we write
through a cache, we put 0 there and back-patch the proper seq_no when we write
the cache to the binlog, so that seq_no ordering follows commit order and
binlog order (per server-id).

When the server shuts down, we write (and fsync()) the current seq_no to a
file gtid.info in the data directory. When we startup, we read it back in to
resume from the correct number. We fsync() the gtid.info file before marking
the last binlog file as cleanly closed. And if at startup we detect that last
binlog was not cleanly closed, we do the normal binlog crash recovery, and as
part of that extract the last used seq_no from the events seen there, instead
of relying on the (probably corrupt) gtid.info.

The server remembers (eg. in a hash) the last seq_no seen for every
server_id. When a new binlog file is written, this list of GTIDs is written
out at the start of the binlog as a new Gtid_list_log_event.

This allows to find the location in the server binlogs of any given GTID (server_id,
seq_no): start from the last binlog file, and scan backwards. For each binlog
file, read the Gtid_list_log_event at the start. If server_id is found with a
lower-or-equal seq_no, then the GTID is found in this file, and we can scan
forward until we find it. If server_id is found with a bigger seq_no, then the GTID
exists in an earlier binlog file (or has been purged if this is true for every
binlog file). If server_id is not found in the Gtid_list_log_event, then the GTID
was never seen by this server.

Slave replication state

By the slave replication state, we denote what currently is the master binlog
file name and file offset of the last event group applied on a slave. This is
needed when reconnecting a slave to a master to resume replication at the
correct point in the binlog event stream and not lose or duplicate any events.

When GTIDs are used, the slave replication state is instead the GTID of the
last event group applied on the slave. However, because of multi-master
replication (and later perhaps parallel replication that re-orders events),
the replication state becomes a set of GTIDs last applied.

The slave remembers the GTID with the maximal seq_no for every server_id value
of every event group applied. If an event is received with lower-or-equal
seq_no than already applied for that server_id, then the event is ignored
(same as currently when receiving an event with own server_id). This avoids
duplicating events in circular replication topologies.

The slave replication state (list of GTIDs) is written to gtid.info at
shutdown, and also written at the start of every slave binlog file. This makes
it possible to recover after a slave crash, during the scan of the last
binlog, same way as for seq_no on a master. We can in addition store it in
relay-log.info if --log-slave-updates=OFF, so slaves can run with binlog
disabled, however this will be as crash-unsafe as current replication.

When the slave connects to the master with GTID enabled, it no longer needs to
send master binlog file name and file offset. Instead it sends its replication
state as a list of GTIDs. The master will search back through its binlogs as
described above for the earliest of these GTIDs, and start sending events from
that point on back to the slave. Note that if the server_id of a GTID has
never been seen by a master, it can be ignored; however if it is seen in some
Gtid_list_log_event but has been purged, then it is an error and slave connect
fails (same as currently if slave asks for a binlog file that has been
purged).

When a slave connects using the old style (binlog file name and file offset),
the master will send back the replication state corresponding to this
position. This allows to automatically migrate to GTID; the next time the
slave reconnects, it can use the replication state obtained from previous
connect. It also allows to provision a new slave from a backup made with
mysqldump --master-data or XtraDB. These provide the old-style binlog file
name and positions to use — and after first connect to the master, the slave
can automatically switch over to use global transaction IDs.

Note that the list of GTIDs kept on master and on slave is in fact identical,
even though it is used for different purposes on master and slave. Of course,
it is possible for a server to be both a master and a slave, and use the
information for both purposes.

User interface

The SHOW SLAVE STATUS command needs to be extended to also show the
GTID-enabled replication state, which is the set of GTIDs with maximal seq_no
per server-id applied on the slave. The SHOW MASTER STATUS should also show
it.

The CHANGE MASTER TO command can be used as before. If a GTID replication
state is available, and no explicit filename/offset is given, and master
supports GTID, then GTID will be used to automatically start from the correct
event group.

CHANGE MASTER should also be extended to allow to specify the current
replication state. This should not normally be needed, but can be useful to
experiment or recover from fatal server loss or corruption, etc. Specifying a
GTID sets the maximal seq_no for the given server id. Specifying NULL for
set_no in the GTID removes the server_id from the GTID list, leaving things as
if that server_id was never seen before on the server.

The slave_skip_counter can be used as before. It still records the GTID of any
event group skipped.

START SLAVE UNTIL is extended in syntax to take a GTID. It stops the slave
when a GTID with same server_id and greater-or-equal seq_no is reached (if
equal, the event is applied on the slave before stopping). If given a list of
GTIDs, stops when any of them is reached. (Stopping when all GTIDs in a list
have been reached can be achieved by a sequence of START SLAVE UNTIL
commands).

Switching to a new slave

Suppose we have a number of slaves replicating off of one master (or several
masters with multi-source replication). GTIDs make it simpler to switch to
using one of the slaves as a new master (because the original master died or
is taken down for maintenance or whatever).

In the simple case where there is no multi-source replication (and no parallel
replication that reorders transactions, if that is later implemented), then
the event stream is completely linear, and with the same sequence in every
slave at any point in the replication hierarchy. Then if one runs SHOW SLAVE
STATUS on each slave, there will be one slave that has greater-or-equal seq_no
for each server_id than any other slave.

Now we can simply promote that slave (there may be several equal to choose
from) as the new master. All the other slaves can simply do CHANGE MASTER TO,
specifying the connection details of the new master, and GTID will ensure that
they continue at the correct position.

Of course if the master can be stopped gracefully while switching slave, we
can just let all slaves run until they have all replicated everything from the
master. Then any slave can be promoted as the new master.

If using multi-source (or possibly later implemented parallel replication),
then matters can be more complex (but remember that this is not the common
case — only multi-source replication, where such switch of master is probably
uncommon, and any later implemented parallel replication).

Let us say that for two GTIDs we have

(sid1,seq_no1) <= (sid2,seq_no2) iff sid1==sid2 && seq_no1 <= seq_no2

It is possible that if the old master disappears abruptly (ie. crashes), then
for every pair of slaves S1, S2, the replication state of S1 and S2 may be
in-commensurable: There is a GTID1 on S1 with no greater GTID on S2, and
likewise a GTID2 on S2 with no greater GITD on S1.

In this case no slave is immediately ready to take over as master without some
event groups getting missing. However, we can still do a reliable master
switch, as follows:

First, arbitrarily pick any slave as the new master. Obtain the replication
state (SHOW SLAVE STATUS) of all other slaves. The idea is that we will
replicate all missing changes from every other slave to the newly selected
master to make sure it has everything needed to fullfill the master role.

For each slave S, we do a CHANGE MASTER TO on the server selected as new
master. Then we START SLAVE UNTIL <gtid> for every gtid in the replication
state of S. This ensures that we have every event group seen by S on the new
master. By repeating for every slave server, we end up with a new master
server that will have a replication state that is a superset of all the
remaining slaves. We can then simply CHANGE MASTER TO on all the slaves, and
continue.

As part of this worklog, I will write a script that reads a list of connection
strings to the set of slave servers, and goes through the above procedure,
resulting in the first server on the list being promoted as the new master and
every other slave on the list changed to replicate from the new master.

Comparison with MySQL 5.6 global transaction ID

The main motivation for this design is my dislike for the MySQL 5.6 global
transaction ID design.

Despite all of its flaws (mainly lack of robustness), MySQL replication has
been extremely successful. I believe the reason is that is transparent, in the
sense that the way it works is conceptually simple to understand, and thus
possible to tweak and manipulate by users. Replication consists simply of
sending a stream of the changes done on a master to the slaves to be
repeated. The slave replication state is simply the position in the stream.

The design explained here preserves this conceptual simplicity. The
replication state is still just the position in the stream. We just have a
universal way to refer to that position that works on all servers across the
replication topology. It is still possible to tweak and manipuate that
position (slave_skip_counter, START SLAVE UNTIL, etc.)

Multi-source replication makes the state more complex, since we have now a set
of positions, but it is still conceptually sane.

The MySQL 5.6 design, in contrast, loses the simple concept of replication
state. The state is now an abstract set of all GTIDs ever applied on a slave,
which is harder to grasp and manipulate. The concept of position in a
replication stream becomes meaningless, as parallel replication can
arbitrarily re-order events in the binlogs at different levels of the
replication topology.

(See below for plans on how to extend the design presented here to handle
MySQL-style parallel replication in a way that tries to preserve the nice
properties).

Comparison with the Google transaction ID patch

I think this approach is rather similar in concept to the Google transaction
ID patch. The main differences are probably:

  • The use of the binlog for persisting the replication state in a crash-safe
    manner. I believe this is a better approach than what the Google patch
    does.

  • The extensions to handle multi-source replication, which is necessary as
    this feature has been introduced to MariaDB.

Future expansion for parallel replication

Parallel replication is not a part of this design. This task can be fully
implemted as described here in a self-consistent way. However, parallel
replication is an important feature, and this section describes how to plan
for being able to extend GTIDs later in a nice way to handle parallel
replication.

Now, parallel replication can come in two variants. One is where transactions
are run in parallel, but still committed in the same order on the slave as on
the master. This in-order parallelism has no conflicts with the GTID design
described here, and can be implemented independently. MWL#184 is an example of
this.

The other out-of-order variant is what MySQL 5.6 does. Transactions are
committed on a slave in different order than on the master, so the slave
binlog has transactions in a different order than the master. But this kind of
parallelism can potentially obtain higher degree on the slave, so can be
desirable.

Multi-source replication is related to out-of-order parallelism. On a
multi-source slave, event groups are applied in parallel, but written
interleaved with one another in some arbitrary way in a single
binlog. So a multi-source slave S1 could itself be a master for a deeper-down
slave S2, but currently events on S2 would have to be replicated in-order with
no parallelism, which can be prohibitively expensive.

With multi-source we have extra information available on S1: We know that
events from two different upstream masters have no predefined ordering, while
events from a single upstream master must be applied in the order given. If we
record that information in the binlog of S1, then S2 can use that information
to know that it is safe for it to also apply events from distinct uptream
masters in parallel, but keep the ordering of events belonging to a single
upstream master.

If we do it this way, then we retain the nice conceptual understanding of a
GTID as a well-defined position in a replication stream. If two GTIDs
originate from the same upstream master, they have a well-defined ordering
which will be preserved all across the replication topology.

In effect we now have multiple replication streams in each binlog. Between
different replication streams, there is no ordering implied, however within a
single replication stream GTIDs uniquely and consistently define a simple
linear order.

We can extend this by allowing multiple user-defined streams originating at a
single server, with the application having the responsibility of ensuring that
different streams are really independent. For example, we could create N
streams based on a hash of the used database, to get different databases
replicated in parallel just as the MySQL 5.6 MTS (multi-threaded slave)
feature. But we still retain the concept of GTID as position in a linear
stream — just with multiple streams possible.

The full design of this will be written up in a different task. But we can
prepare the global transaction ID design to be more easily extensible towards
multiple binlog streams, without much extra effort.

Whenever we store a global transaction ID (server_id, seq_no), we also store a
replication stream ID. So we store this in Gtid_log_event and
Gtid_list_log_event, write it in gtid.info (and relay-log.info), and remember
it as part of the slave replication state. We can use a 32-bit unsigned
integer as replication stream id.

For the implemetation of GTID, this will always be zero. Later, we can
implement that multi-source replication can assign different replication
stream id to events from different upstream masters. The idea is that two
events groups with distinct replication stream ID can be replicated in
parallel and committed in any order to the binlog. This will allow a
downstream slave S2 to also parallelise events from different upstream
masters.

Similarly, by allowing applications to annotate different transactions with
different stream IDs, we can achieve the same kind of parallel replication as
MySQL 5.6, in a way that is both simpler and more flexible.

Environment

None

Status

Assignee

Kristian Nielsen

Reporter

Rasmus Johansson

Time estimate

0m

Fix versions

Priority

Major