Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-7432

STOP SLAVE in the middle of event group can cause rollback of transaction mixing transactional and non-transactional DML, causing slave corruption

    Details

    • Type: Bug
    • Status: Stalled
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 10.1, 10.0
    • Fix Version/s: 10.1, 10.0
    • Component/s: Tests
    • Labels:

      Description

      http://buildbot.askmonty.org/buildbot/builders/work-amd64-valgrind/builds/6554/steps/test/logs/stdio

      rpl.rpl_parallel 'innodb_plugin,row'     w3 [ fail ]
              Test ended at 2015-01-06 23:50:40
      
      CURRENT_TEST: rpl.rpl_parallel
      mysqltest: In included file "./include/wait_for_slave_param.inc": 
      included from ./include/wait_for_slave_sql_to_start.inc at line 32:
      included from ./include/wait_for_slave_to_start.inc at line 27:
      included from ./include/start_slave.inc at line 35:
      included from /var/lib/buildbot/maria-slave/work-opensuse-amd64/build/mysql-test/suite/rpl/t/rpl_parallel.test at line 344:
      At line 115: Timeout in include/wait_for_slave_param.inc
      
      The result from queries just before the failure was:
      < snip >
      'group_commit_waiting_for_prior SIGNAL slave_queued3',
      ''))
      master-bin.000002	1417	Xid	1	1444	COMMIT /* xid=358 */
      master-bin.000002	1444	Gtid	1	1482	BEGIN GTID 0-1-15
      master-bin.000002	1482	Query	1	1571	use `test`; INSERT INTO t2 VALUES (20)
      master-bin.000002	1571	Query	1	1660	use `test`; INSERT INTO t1 VALUES (20)
      master-bin.000002	1660	Query	1	1749	use `test`; INSERT INTO t2 VALUES (21)
      master-bin.000002	1749	Query	1	1842	use `test`; INSERT INTO t3 VALUES (20, 20)
      master-bin.000002	1842	Xid	1	1869	COMMIT /* xid=421 */
      master-bin.000002	1869	Gtid	1	1907	BEGIN GTID 0-1-16
      master-bin.000002	1907	Query	1	1999	use `test`; INSERT INTO t3 VALUES(21, 21)
      master-bin.000002	1999	Xid	1	2026	COMMIT /* xid=426 */
      master-bin.000002	2026	Gtid	1	2064	BEGIN GTID 0-1-17
      master-bin.000002	2064	Query	1	2156	use `test`; INSERT INTO t3 VALUES(22, 22)
      master-bin.000002	2156	Xid	1	2183	COMMIT /* xid=427 */
      
      **** SHOW RELAYLOG EVENTS on server_1 ****
      relaylog_name = 'No such row'
      SHOW RELAYLOG EVENTS IN 'No such row';
      Log_name	Pos	Event_type	Server_id	End_log_pos	Info
      
      More results from queries before failure can be found in /dev/shm/var_auto_lwsQ/3/log/rpl_parallel.log
      
       - saving '/dev/shm/var_auto_lwsQ/3/log/rpl.rpl_parallel-innodb_plugin,row/' to '/var/lib/buildbot/maria-slave/work-opensuse-amd64/build/mysql-test/var/log/rpl.rpl_parallel-innodb_plugin,row/'
      
      Retrying test rpl.rpl_parallel, attempt(2/3)...
      
      worker[3] > Restart  - not started
      worker[3] > Restart  - not started
      ***Warnings generated in error logs during shutdown after running tests: rpl.rpl_gtid_startpos rpl.rpl_commit_after_flush rpl.rpl_gtid_errorhandling rpl.rpl_foreign_key_innodb rpl.rpl_old_master rpl.rpl_bug26395 rpl.rpl_gtid_mdev4484 rpl.rpl_failed_optimize rpl.rpl_dual_pos_advance rpl.rpl_gtid_ignored rpl.rpl_insert_id_pk rpl.rpl_gtid_reconnect rpl.rpl_insert_ignore rpl.rpl_insert_id rpl.rpl_parallel rpl.rpl_multi_engine rpl.rpl_gtid_until rpl.rpl_mysql_upgrade rpl.rpl_checksum_cache
      
      150106 23:44:13 [Warning] Unsafe statement written to the binary log using statement format since BINLOG_FORMAT = STATEMENT. Statement is unsafe because it accesses a non-transactional table after accessing a transactional table within the same transaction. Statement: INSERT INTO t1 VALUES (20)
      150106 23:44:15 [Warning] Slave: Duplicate entry '20' for key 'PRIMARY' Error_code: 1062
      

        Gliffy Diagrams

          Attachments

            Issue Links

              Activity

              Hide
              knielsen Kristian Nielsen added a comment -

              Failure can be reproduced by adding this sleep in the code:

              === modified file 'sql/rpl_parallel.cc'
              --- sql/rpl_parallel.cc	2015-01-07 13:45:39 +0000
              +++ sql/rpl_parallel.cc	2015-01-14 07:22:39 +0000
              @@ -2107,6 +2107,7 @@ rpl_parallel::do_event(rpl_group_info *s
                 unlock_or_exit_cond(rli->sql_driver_thd, &cur_thread->LOCK_rpl_thread,
                                     &did_enter_cond, &old_stage);
                 mysql_cond_signal(&cur_thread->COND_rpl_thread);
              +my_sleep(100000);
               
                 return 0;
               }
              
              
              Show
              knielsen Kristian Nielsen added a comment - Failure can be reproduced by adding this sleep in the code: === modified file 'sql/rpl_parallel.cc' --- sql/rpl_parallel.cc 2015-01-07 13:45:39 +0000 +++ sql/rpl_parallel.cc 2015-01-14 07:22:39 +0000 @@ -2107,6 +2107,7 @@ rpl_parallel::do_event(rpl_group_info *s unlock_or_exit_cond(rli->sql_driver_thd, &cur_thread->LOCK_rpl_thread, &did_enter_cond, &old_stage); mysql_cond_signal(&cur_thread->COND_rpl_thread); +my_sleep(100000); return 0; }
              Hide
              knielsen Kristian Nielsen added a comment -

              So this looks like a real bug in the code.

              When STOP SLAVE is run, a flag is set.

              When a worker thread sees the stop flag set, it checks if it has any more
              events queued up for it. If not, it force rolls back any current transaction.
              Since after the stop flag is set, no more events will be queued for worker
              threads.

              However, if the transaction (event group, really) contains a mix of
              transactional and non-transactional DML, then this will corrupt the slave
              state. Because the transactional part will be rolled back, but the
              non-transactional part will remain.

              So this means that STOP SLAVE is currently not safe for parallel replication
              if there are event groups containing multiple non-transactional DML events.

              To fix, I think a timeout mechanism needs to be implemented in the SQL driver
              thread, similar to the non-parallel case. At STOP SLAVE, continue queuing
              events into the relay log and into worker threads until either the current
              event group has been completely queued, or until the stop timeout has elapsed.

              This is however complicated by the fact that in the SQL driver thread, we
              cannot really know if an event group contains non-transactional events. In the
              non-parallel case, we know this, and can apply the timeout only if
              non-transactional events are present (making rollback in the middle of the
              event group unsafe).

              In the parallel case, seems we would need to apply the timeout always, or in
              10.1 we could look at the "transactional" flag in the GTID event (but then we
              will still be unsafe if replicating from InnoDB on the master to MyISAM on the
              slave, something however that in general is not really safe to do for parallel
              replication).

              Show
              knielsen Kristian Nielsen added a comment - So this looks like a real bug in the code. When STOP SLAVE is run, a flag is set. When a worker thread sees the stop flag set, it checks if it has any more events queued up for it. If not, it force rolls back any current transaction. Since after the stop flag is set, no more events will be queued for worker threads. However, if the transaction (event group, really) contains a mix of transactional and non-transactional DML, then this will corrupt the slave state. Because the transactional part will be rolled back, but the non-transactional part will remain. So this means that STOP SLAVE is currently not safe for parallel replication if there are event groups containing multiple non-transactional DML events. To fix, I think a timeout mechanism needs to be implemented in the SQL driver thread, similar to the non-parallel case. At STOP SLAVE, continue queuing events into the relay log and into worker threads until either the current event group has been completely queued, or until the stop timeout has elapsed. This is however complicated by the fact that in the SQL driver thread, we cannot really know if an event group contains non-transactional events. In the non-parallel case, we know this, and can apply the timeout only if non-transactional events are present (making rollback in the middle of the event group unsafe). In the parallel case, seems we would need to apply the timeout always, or in 10.1 we could look at the "transactional" flag in the GTID event (but then we will still be unsafe if replicating from InnoDB on the master to MyISAM on the slave, something however that in general is not really safe to do for parallel replication).

                People

                • Assignee:
                  knielsen Kristian Nielsen
                  Reporter:
                  elenst Elena Stepanova
                • Votes:
                  0 Vote for this issue
                  Watchers:
                  1 Start watching this issue

                  Dates

                  • Created:
                    Updated: