Details
-
Type:
Bug
-
Status: Closed
-
Priority:
Minor
-
Resolution: Fixed
-
Affects Version/s: 10.0.2
-
Fix Version/s: 10.0.3
-
Component/s: None
-
Labels:None
Description
I set replication 1->2 to use GTID, start it, execute some events on server 1 and server 2, then set replication 2->1 to use GTID too, and attempt to start it.
In the example below, it fails with "'Table 'test.t2' doesn't exist'", apparently it misses an event upon startup, although it's present in the binary log.
Output of the test case provided below
# # For now we'll only have 1->2 running # # Server 1 # Stop replication 2->1 include/stop_slave.inc # # Server 2 # Use GTID for replication 1->2 include/stop_slave.inc change master to master_use_gtid=1; include/start_slave.inc # # Create some 0-1-* and 0-2-* events in binlog of server 2 connection server_1; create table t1 (i int) engine=InnoDB; insert into t1 values (1); connection server_2; create table t2 (i int) engine=InnoDB; connection server_1; insert into t1 values (2); connection server_2; insert into t2 values (1); # # All events are present in the binlog of server 2 show binlog events; Log_name Pos Event_type Server_id End_log_pos Info slave-bin.000001 4 Format_desc 2 248 Server ver: 10.0.1-MariaDB-debug-log, Binlog ver: 4 slave-bin.000001 248 Gtid_list 2 271 [] slave-bin.000001 271 Binlog_checkpoint 2 310 slave-bin.000001 slave-bin.000001 310 Gtid 1 348 GTID 0-1-1 slave-bin.000001 348 Query 1 453 use `test`; create table t1 (i int) engine=InnoDB slave-bin.000001 453 Gtid 1 491 BEGIN GTID 0-1-2 slave-bin.000001 491 Query 1 584 use `test`; insert into t1 values (1) slave-bin.000001 584 Xid 1 611 COMMIT /* xid=277 */ slave-bin.000001 611 Gtid 2 649 GTID 0-2-3 slave-bin.000001 649 Query 2 754 use `test`; create table t2 (i int) engine=InnoDB slave-bin.000001 754 Gtid 1 792 BEGIN GTID 0-1-3 slave-bin.000001 792 Query 1 885 use `test`; insert into t1 values (2) slave-bin.000001 885 Xid 1 912 COMMIT /* xid=282 */ slave-bin.000001 912 Gtid 2 950 BEGIN GTID 0-2-4 slave-bin.000001 950 Query 2 1043 use `test`; insert into t2 values (1) slave-bin.000001 1043 Xid 2 1070 COMMIT /* xid=283 */ # # Server 1 # Start replication 2->1 using GTID, # it fails with 'Table 'test.t2' doesn't exist' # (which shows up either as a failure on sync_with_master, # or more often as hanging start_slave.inc) change master to master_use_gtid=1; include/start_slave.inc
MariaDB [test]> show slave status \G
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 127.0.0.1
Master_User: root
Master_Port: 16001
Connect_Retry: 1
Master_Log_File: slave-bin.000001
Read_Master_Log_Pos: 1070
Relay_Log_File: master-relay-bin.000002
Relay_Log_Pos: 597
Relay_Master_Log_File: slave-bin.000001
Slave_IO_Running: Yes
Slave_SQL_Running: No
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 1146
Last_Error: Error 'Table 'test.t2' doesn't exist' on query. Default database: 'test'. Query: 'insert into t2 values (1)'
Skip_Counter: 0
Exec_Master_Log_Pos: 310
Relay_Log_Space: 1053
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 1146
Last_SQL_Error: Error 'Table 'test.t2' doesn't exist' on query. Default database: 'test'. Query: 'insert into t2 values (1)'
Replicate_Ignore_Server_Ids:
Master_Server_Id: 2
Using_Gtid: 1
1 row in set (0.00 sec)
Test case:
--source include/have_innodb.inc
--let $rpl_topology=1->2->1
--source include/rpl_init.inc
--echo #
--echo # For now we'll only have 1->2 running
--echo #
--echo # Server 1
--echo # Stop replication 2->1
--connection server_1
--source include/stop_slave.inc
--echo #
--echo # Server 2
--echo # Use GTID for replication 1->2
--connection server_2
--source include/stop_slave.inc
change master to master_use_gtid=1;
--source include/start_slave.inc
--echo #
--echo # Create some 0-1-* and 0-2-* events in binlog of server 2
--enable_connect_log
--connection server_1
create table t1 (i int) engine=InnoDB;
insert into t1 values (1);
--save_master_pos
--connection server_2
--sync_with_master
create table t2 (i int) engine=InnoDB;
--save_master_pos
--connection server_1
insert into t1 values (2);
--save_master_pos
--connection server_2
--sync_with_master
insert into t2 values (1);
--save_master_pos
--disable_connect_log
--echo #
--echo # All events are present in the binlog of server 2
show binlog events;
--echo #
--echo # Server 1
--echo # Start replication 2->1 using GTID,
--echo # it fails with 'Table 'test.t2' doesn't exist'
--echo # (which shows up either as a failure on sync_with_master,
--echo # or more often as hanging start_slave.inc)
--connection server_1
change master to master_use_gtid=1;
--source include/start_slave.inc
--sync_with_master
--source include/rpl_end.inc
cnf file:
!include suite/rpl/rpl_1slave_base.cnf !include include/default_client.cnf [mysqld.1] log-slave-updates loose-innodb [mysqld.2] log-slave-updates loose-innodb
bzr version-info
revision-id: knielsen@knielsen-hq.org-20130503092729-gedp152b19k5wdnj revno: 3626 branch-nick: 10.0-base
Gliffy Diagrams
Attachments
Issue Links
- relates to
-
MDEV-26 Global transaction ID
-
- Closed
-
Activity
- All
- Comments
- Work Log
- History
- Activity
- Transitions
Thanks for testing this!
You are in uncharted territory, I did consider circular topologies in the
design but did not test yet
There is one problem with your test. You have two masters active at the same
time. Doing this with GTID requires configuring different gtid_domain_id for
the two masters.
It does not help that you stopped the direction 2->1. What matters is that you
have two masters (whether their slave is running or not at the precise
moment), and you are doing updates on one without first replicating all
changes from the other.
Concretely, we have
S2: create table t2 ...
S1: insert into t1 ...
On S2, "create table t2" will be binlogged before "insert into t1". But on S1,
"insert into t1" is binlogged first.
So when S1 connects with GTID as slave to S2, it asks to start from the
"insert into t1" which is the latest GTID it applied in domain 0. But this is
after "create table t2" in binlog on S2, so that event is lost.
There are two ways to do this correctly:
1. Either configure different domain_ids for S1 and S2.
2. Or alternatively, use a single domain and make sure that everything is
replicated S2->S1 before doing changes on S1, and vice versa. Basically, let
both slave directions run at the same time, and --sync-with-master each time
before doing updates on the next server.
So I would like to handle this in two stages.
First, we should make sure that (1) and (2) actually work correctly (they
should, but it definitely needs testing, there are likely bugs).
Second, while what the test does is fundamentally incorrect and cannot work,
it would still be best if we can give a better user experience, give a clear
error message rather than silently dropping events.
"First" we should do now. "Second" I would like to revisit later, when the
basic stuff has been better tested and is solid. Multimaster ring is somewhat
of an advanced concept, and it is more reasonable to expect more knowledge
from the DBA who sets that up. My vague idea is that we could implement a GTID
"strict" mode that would detect the wrong configuration and give an error
immediately. The detection would be to see that S2 gets an event from S1 with
the same sequence number that it already logged itself to its own binlog. Such
strict mode can probably not be on by default though, as then a simple upgrade
to 10.0 would break ring setups, even if users have no plans to use GTID.
What do you think?