Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-7818

NEED REVIEW: Deadlock occurring with parallel replication and FTWRL

    Details

      Description

      Thread 5 is locked by global read lock but FTWRL never completes and stays in waiting for commit lock; replication thread stays locked forever until FTWRL is killed

      MariaDB(db-02)[(none)]> show processlist\G
      *************************** 1. row ***************************
            Id: 3
          User: system user
          Host: 
            db: NULL
       Command: Connect
          Time: 12612
         State: Waiting for prior transaction to start commit before starting next transaction
          Info: NULL
      Progress: 0.000
      *************************** 2. row ***************************
            Id: 4
          User: system user
          Host: 
            db: NULL
       Command: Connect
          Time: 12612
         State: Waiting for prior transaction to commit
          Info: NULL
      Progress: 0.000
      *************************** 3. row ***************************
            Id: 5
          User: system user
          Host: 
            db: crypto_data
       Command: Connect
          Time: 12612
         State: Waiting for global read lock
          Info: INSERT INTO `market_orders_ok` (`label`, `marketid`, `ordertype`, `price`, `quantity`, `total`) VALU
      Progress: 0.000
      *************************** 4. row ***************************
            Id: 6
          User: system user
          Host: 
            db: NULL
       Command: Connect
          Time: 12612
         State: Waiting for prior transaction to start commit before starting next transaction
          Info: NULL
      Progress: 0.000
      *************************** 5. row ***************************
            Id: 7
          User: system user
          Host: 
            db: NULL
       Command: Connect
          Time: 1651381
         State: Waiting for master to send event
          Info: NULL
      Progress: 0.000
      *************************** 6. row ***************************
            Id: 8
          User: system user
          Host: 
            db: NULL
       Command: Connect
          Time: 13263
         State: Waiting for room in worker thread event queue
          Info: NULL
      Progress: 0.000
      *************************** 7. row ***************************
            Id: 39
          User: monitoring
          Host: localhost
            db: NULL
       Command: Sleep
          Time: 59
         State: 
          Info: NULL
      Progress: 0.000
      *************************** 8. row ***************************
            Id: 1568378
          User: tanje6acfc0c1213cfb63
          Host: localhost
            db: NULL
       Command: Sleep
          Time: 3
         State: 
          Info: NULL
      Progress: 0.000
      *************************** 9. row ***************************
            Id: 27871929
          User: backup
          Host: localhost
            db: NULL
       Command: Query
          Time: 12612
         State: Waiting for commit lock
          Info: FLUSH TABLES WITH READ LOCK
      Progress: 0.000
      *************************** 10. row ***************************
            Id: 28082858
          User: tanje6acfc0c1213cfb63
          Host: localhost
            db: NULL
       Command: Query
          Time: 0
         State: init
          Info: show processlist
      Progress: 0.000
      10 rows in set (0.00 sec)
      

        Gliffy Diagrams

          Attachments

            Issue Links

              Activity

              Hide
              elenst Elena Stepanova added a comment -

              I am able to reproduce it with this (very crude and not very reliable) test:

              
              --source include/master-slave.inc
              --source include/have_binlog_format_statement.inc
              --source include/have_innodb.inc
              
              create table t1 (i int) engine=InnoDB;
              create table t2 (i int) engine=InnoDB;
              create table t3 (i int) engine=InnoDB;
              
              --sync_slave_with_master
              install soname 'metadata_lock_info';
              
              set global binlog_commit_wait_count=2, binlog_commit_wait_usec=2000000;
              --source include/stop_slave.inc
              set global slave_parallel_threads=4;
              
              --connection master
              set global binlog_commit_wait_count=2, binlog_commit_wait_usec=1000000;
              begin;
              insert into t3 values (1),(2),(3),(4);
              insert into t3 select * from t3 /* 1 */;
              insert into t3 select * from t3 /* 2 */;
              insert into t3 select * from t3 /* 3 */;
              insert into t3 select * from t3 /* 4 */;
              insert into t3 select * from t3 /* 5 */;
              insert into t3 select * from t3 /* 6 */;
              insert into t3 select * from t3 /* 7 */;
              insert into t3 select * from t3 /* 8 */;
              insert into t3 select * from t3 /* 9 */;
              insert into t3 select * from t3 /* 10 */;
              insert into t3 select * from t3 /* 11 */;
              insert into t3 select * from t3 /* 12 */;
              insert into t1 values (1),(2),(3),(4);
              insert into t1 select * from t1;
              --send commit
              
              --connection master1
              insert into t2 values (1),(2);
              
              --connection master
              --reap
              
              --connection slave
              --source include/start_slave.inc
              --sleep 1
              --send FLUSH TABLES WITH READ LOCK
              
              --connection slave1
              --let $show_statement= SHOW PROCESSLIST
              --let $field= State
              --let $condition= 'Waiting for commit lock'
              --source include/wait_show_condition.inc
              
              SHOW PROCESSLIST;
              select * from information_schema.metadata_lock_info;
              
              --connection slave
              --reap
              UNLOCK TABLES;
              
              Id	User	Host	db	Command	Time	State	Info	Progress
              4	root	localhost:37845	test	Sleep	20		NULL	0.000
              5	root	localhost:37846	test	Sleep	21		NULL	0.000
              8	root	localhost:37852	test	Query	0	Waiting for global read lock	FLUSH TABLES WITH READ LOCK	0.000
              9	root	localhost:37853	test	Query	0	init	SHOW PROCESSLIST	0.000
              10	system user		NULL	Connect	1	Waiting for master to send event	NULL	0.000
              11	system user		NULL	Connect	1	Waiting for work from SQL thread	NULL	0.000
              12	system user		NULL	Connect	1	Waiting for prior transaction to commit	NULL	0.000
              13	system user		NULL	Connect	1	Waiting for work from SQL thread	NULL	0.000
              14	system user		test	Connect	17	Copying to tmp table	insert into t3 select * from t3 /* 9 */	0.000
              15	system user		NULL	Connect	1	Slave has read all relay log; waiting for the slave I/O thread to update it	NULL	0.000
              
              select * from information_schema.metadata_lock_info;
              THREAD_ID	LOCK_MODE	LOCK_DURATION	LOCK_TYPE	TABLE_SCHEMA	TABLE_NAME
              14	MDL_INTENTION_EXCLUSIVE	MDL_STATEMENT	Global read lock		
              12	MDL_INTENTION_EXCLUSIVE	MDL_EXPLICIT	Commit lock		
              12	MDL_SHARED_WRITE	MDL_TRANSACTION	Table metadata lock	mysql	gtid_slave_pos
              14	MDL_SHARED_WRITE	MDL_TRANSACTION	Table metadata lock	test	t3
              12	MDL_SHARED_WRITE	MDL_TRANSACTION	Table metadata lock	test	t2
              
              Show
              elenst Elena Stepanova added a comment - I am able to reproduce it with this (very crude and not very reliable) test: --source include/master-slave.inc --source include/have_binlog_format_statement.inc --source include/have_innodb.inc create table t1 (i int) engine=InnoDB; create table t2 (i int) engine=InnoDB; create table t3 (i int) engine=InnoDB; --sync_slave_with_master install soname 'metadata_lock_info'; set global binlog_commit_wait_count=2, binlog_commit_wait_usec=2000000; --source include/stop_slave.inc set global slave_parallel_threads=4; --connection master set global binlog_commit_wait_count=2, binlog_commit_wait_usec=1000000; begin; insert into t3 values (1),(2),(3),(4); insert into t3 select * from t3 /* 1 */; insert into t3 select * from t3 /* 2 */; insert into t3 select * from t3 /* 3 */; insert into t3 select * from t3 /* 4 */; insert into t3 select * from t3 /* 5 */; insert into t3 select * from t3 /* 6 */; insert into t3 select * from t3 /* 7 */; insert into t3 select * from t3 /* 8 */; insert into t3 select * from t3 /* 9 */; insert into t3 select * from t3 /* 10 */; insert into t3 select * from t3 /* 11 */; insert into t3 select * from t3 /* 12 */; insert into t1 values (1),(2),(3),(4); insert into t1 select * from t1; --send commit --connection master1 insert into t2 values (1),(2); --connection master --reap --connection slave --source include/start_slave.inc --sleep 1 --send FLUSH TABLES WITH READ LOCK --connection slave1 --let $show_statement= SHOW PROCESSLIST --let $field= State --let $condition= 'Waiting for commit lock' --source include/wait_show_condition.inc SHOW PROCESSLIST; select * from information_schema.metadata_lock_info; --connection slave --reap UNLOCK TABLES; Id User Host db Command Time State Info Progress 4 root localhost:37845 test Sleep 20 NULL 0.000 5 root localhost:37846 test Sleep 21 NULL 0.000 8 root localhost:37852 test Query 0 Waiting for global read lock FLUSH TABLES WITH READ LOCK 0.000 9 root localhost:37853 test Query 0 init SHOW PROCESSLIST 0.000 10 system user NULL Connect 1 Waiting for master to send event NULL 0.000 11 system user NULL Connect 1 Waiting for work from SQL thread NULL 0.000 12 system user NULL Connect 1 Waiting for prior transaction to commit NULL 0.000 13 system user NULL Connect 1 Waiting for work from SQL thread NULL 0.000 14 system user test Connect 17 Copying to tmp table insert into t3 select * from t3 /* 9 */ 0.000 15 system user NULL Connect 1 Slave has read all relay log; waiting for the slave I/O thread to update it NULL 0.000 select * from information_schema.metadata_lock_info; THREAD_ID LOCK_MODE LOCK_DURATION LOCK_TYPE TABLE_SCHEMA TABLE_NAME 14 MDL_INTENTION_EXCLUSIVE MDL_STATEMENT Global read lock 12 MDL_INTENTION_EXCLUSIVE MDL_EXPLICIT Commit lock 12 MDL_SHARED_WRITE MDL_TRANSACTION Table metadata lock mysql gtid_slave_pos 14 MDL_SHARED_WRITE MDL_TRANSACTION Table metadata lock test t3 12 MDL_SHARED_WRITE MDL_TRANSACTION Table metadata lock test t2
              Hide
              knielsen Kristian Nielsen added a comment -

              Ok, so I think this is what happens.

              We have transactions T1 followed by T2.

              T2 gets ready to commit. It acquires the commit metadata lock in
              ha_commit_trans().

              Then in the binlog T2 needs to wait for T1 to commit first.

              FLUSH TABLES WITH READ LOCK runs, it needs to wait for the commit lock held by
              T2.

              T1 runs its INSERT statement, it gets blocked by the running FLUSH TABLES WITH
              READ LOCK.

              At this point, we have a deadlock.

              So the question is what the correct behaviour should be here.

              It is not safe to abort the commit of T2 with an error (and possibly retry it
              later). Because T2 may be non-transactional and not safe to roll back.

              One option seems to be to allow T1 to run despite the pending FTWRL. Because
              arguably, T1 already owns the commit lock together with T2. This would require
              that all places that a query (like INSERT) waits for FTWRL, the wait can be
              skipped in the parallel replication case, and it also requires some way to
              ensure that a later T3 will be blocked by FTWRL. I am not sure how to handle
              these points at the moment.

              Another option might be to let T2 release the commit lock when it goes to wait
              for a prior commit. This would be in queue_for_group_commit() (and probably
              ha_commit_one_phase() also needs similar attention?). And then re-aquire the
              lock after the wait. Or preferably I guess would be to not aquire the commit
              lock at all until after the wait for prior commit. (Group commit might then
              only have to aquire the lock once, in the leader). I am not at the moment sure
              what is required exactly to make this correct, needs more knowledge of the
              usage of the commit metadata lock...

              Show
              knielsen Kristian Nielsen added a comment - Ok, so I think this is what happens. We have transactions T1 followed by T2. T2 gets ready to commit. It acquires the commit metadata lock in ha_commit_trans(). Then in the binlog T2 needs to wait for T1 to commit first. FLUSH TABLES WITH READ LOCK runs, it needs to wait for the commit lock held by T2. T1 runs its INSERT statement, it gets blocked by the running FLUSH TABLES WITH READ LOCK. At this point, we have a deadlock. So the question is what the correct behaviour should be here. It is not safe to abort the commit of T2 with an error (and possibly retry it later). Because T2 may be non-transactional and not safe to roll back. One option seems to be to allow T1 to run despite the pending FTWRL. Because arguably, T1 already owns the commit lock together with T2. This would require that all places that a query (like INSERT) waits for FTWRL, the wait can be skipped in the parallel replication case, and it also requires some way to ensure that a later T3 will be blocked by FTWRL. I am not sure how to handle these points at the moment. Another option might be to let T2 release the commit lock when it goes to wait for a prior commit. This would be in queue_for_group_commit() (and probably ha_commit_one_phase() also needs similar attention?). And then re-aquire the lock after the wait. Or preferably I guess would be to not aquire the commit lock at all until after the wait for prior commit. (Group commit might then only have to aquire the lock once, in the leader). I am not at the moment sure what is required exactly to make this correct, needs more knowledge of the usage of the commit metadata lock...
              Hide
              knielsen Kristian Nielsen added a comment -

              A simpler, but less convenient, solution could be to forbid running FLUSH TABLES WITH READ LOCK while a parallel replication slave is running. The user would then need to STOP SLAVE; FLUSH TABLES WITH READ LOCK; START SLAVE. The net effect would be much the same, as the slave cannot do any work while FLUSH TABLES WITH READ LOCK is running anyway.

              Show
              knielsen Kristian Nielsen added a comment - A simpler, but less convenient, solution could be to forbid running FLUSH TABLES WITH READ LOCK while a parallel replication slave is running. The user would then need to STOP SLAVE; FLUSH TABLES WITH READ LOCK; START SLAVE. The net effect would be much the same, as the slave cannot do any work while FLUSH TABLES WITH READ LOCK is running anyway.
              Hide
              tanj Guillaume Lefranc added a comment - - edited

              I am concerned by the fact that this last solution would break most backup utilities when parallel replication is enabled.

              If MariaDB would port backup locks over from Percona then that would not be a problem anymore

              Show
              tanj Guillaume Lefranc added a comment - - edited I am concerned by the fact that this last solution would break most backup utilities when parallel replication is enabled. If MariaDB would port backup locks over from Percona then that would not be a problem anymore
              Hide
              knielsen Kristian Nielsen added a comment -

              Agree, hopefully I can come up with a fix that "just works" and does not require any special user action.

              Show
              knielsen Kristian Nielsen added a comment - Agree, hopefully I can come up with a fix that "just works" and does not require any special user action.
              Hide
              knielsen Kristian Nielsen added a comment -

              Current idea:

              When FTWRL starts, it first checks all parallel replication worker
              threads. It finds the most recent GTID started by any of them. It then sets
              a flag to tell the threads not to start on any newer GTIDs, and then waits
              for all earlier GTIDs to fully commit. It also sets a flag to tell START
              SLAVE, STOP SLAVE, and the SQL thread to not start any new slave activity.

              Once all worker threads have reached their designated point, FTWLR continues
              to take the global read lock. Once that is obtained, it clears the flags and
              signals worker threads and other slave code that it can proceed. At this
              point, the lock is held, so no real activity will be possible until the lock
              is cleared with UNLOCK TABLES.

              This should hopefully fix the deadlock, at least I got the test case of
              Elena to pass with a preliminary patch along these lines.

              Some care will probably be needed to guard against other deadlocks against
              concurrent START SLAVE / STOP SLAVE, hopefully I can get that solved.

              Show
              knielsen Kristian Nielsen added a comment - Current idea: When FTWRL starts, it first checks all parallel replication worker threads. It finds the most recent GTID started by any of them. It then sets a flag to tell the threads not to start on any newer GTIDs, and then waits for all earlier GTIDs to fully commit. It also sets a flag to tell START SLAVE, STOP SLAVE, and the SQL thread to not start any new slave activity. Once all worker threads have reached their designated point, FTWLR continues to take the global read lock. Once that is obtained, it clears the flags and signals worker threads and other slave code that it can proceed. At this point, the lock is held, so no real activity will be possible until the lock is cleared with UNLOCK TABLES. This should hopefully fix the deadlock, at least I got the test case of Elena to pass with a preliminary patch along these lines. Some care will probably be needed to guard against other deadlocks against concurrent START SLAVE / STOP SLAVE, hopefully I can get that solved.
              Hide
              knielsen Kristian Nielsen added a comment - - edited

              I have a fix for this now (that just removes the deadlock, without breaking
              existing backup tools or otherwise changing user-visible behaviour).

              It is in three parts. The actual fix is the second patch. The first patch is
              just unrelated refactoring of the parallel replication code to make the
              second patch cleaner. The third patch is a test case.

              http://lists.askmonty.org/pipermail/commits/2015-June/008059.html
              http://lists.askmonty.org/pipermail/commits/2015-June/008060.html
              http://lists.askmonty.org/pipermail/commits/2015-June/008061.html

              Elena: You asked some days ago - this patch should be ready, if you want to
              apply it in further testing.

              EDIT: Updated with newest version of patches after fixing MDEV-8318.

              Show
              knielsen Kristian Nielsen added a comment - - edited I have a fix for this now (that just removes the deadlock, without breaking existing backup tools or otherwise changing user-visible behaviour). It is in three parts. The actual fix is the second patch. The first patch is just unrelated refactoring of the parallel replication code to make the second patch cleaner. The third patch is a test case. http://lists.askmonty.org/pipermail/commits/2015-June/008059.html http://lists.askmonty.org/pipermail/commits/2015-June/008060.html http://lists.askmonty.org/pipermail/commits/2015-June/008061.html Elena: You asked some days ago - this patch should be ready, if you want to apply it in further testing. EDIT: Updated with newest version of patches after fixing MDEV-8318 .
              Hide
              elenst Elena Stepanova added a comment -

              See MDEV-8318. I made it a separate report not to pollute this one too much.

              Show
              elenst Elena Stepanova added a comment - See MDEV-8318 . I made it a separate report not to pollute this one too much.

                People

                • Assignee:
                  knielsen Kristian Nielsen
                  Reporter:
                  tanj Guillaume Lefranc
                • Votes:
                  0 Vote for this issue
                  Watchers:
                  3 Start watching this issue

                  Dates

                  • Created:
                    Updated: