Details

    • Type: Bug
    • Status: In Progress
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 10.0.20, 10.1.6
    • Fix Version/s: 10.1, 10.0
    • Component/s: Replication, Tests
    • Labels:
      None
    • Sprint:
      10.1.6-2, 10.0.21

      Description

      rpl.rpl_000011 'row' w1 [ fail ]
      Test ended at 2015-06-10 17:26:28

      CURRENT_TEST: rpl.rpl_000011
      — /usr/share/mysql/mysql-test/suite/rpl/r/rpl_000011.result 2015-06-10 13:40:30.000000000 +0300
      +++ /run/shm/var/1/log/rpl_000011.reject 2015-06-10 17:26:28.339504149 +0300
      @@ -8,7 +8,7 @@
      stop slave;
      show global status like 'com_insert';
      Variable_name Value
      -Com_insert 1
      +Com_insert 2
      include/wait_for_slave_to_stop.inc
      start slave;
      include/wait_for_slave_to_start.inc

      mysqltest: Result content mismatch

      Fails sporadically, but on many hosts.

        Gliffy Diagrams

          Attachments

            Activity

            Hide
            knielsen Kristian Nielsen added a comment -

            So this turns out to actually be an old bug.
            It just happened to trigger more easily after patch for MDEV-8294.

            The race is that there is a small window where SHOW GLOBAL STATUS may count
            twice the values of a thread that is exiting. This happens also for normal
            threads, not just the slave thread.

            Here is a test case for latest 10.0, triggers with the following sleep
            patched in:

            diff --git a/sql/mysqld.cc b/sql/mysqld.cc
            index e05c0b6..75af8f7 100644
            --- a/sql/mysqld.cc
            +++ b/sql/mysqld.cc
            @@ -2759,6 +2759,7 @@ void unlink_thd(THD *thd)
             
               thd->add_status_to_global();
             
            +my_sleep(2000000);
               mysql_mutex_lock(&LOCK_thread_count);
               thd->unlink();
               /*
            
            CREATE TABLE t1 (a INT PRIMARY KEY);
            
            connect (c1,localhost,root,,);
            INSERT INTO t1 VALUES (1);
            INSERT INTO t1 VALUES (2);
            INSERT INTO t1 VALUES (3);
            --let $cid= `SELECT CONNECTION_ID()`
            
            --connection default
            eval KILL CONNECTION $cid;
            --sleep 1
            SHOW GLOBAL STATUS LIKE 'Com_insert';
            --sleep 2
            SHOW GLOBAL STATUS LIKE 'Com_insert';
            
            DROP TABLE t1;
            

            The first SHOW GLOBAL STATUS counts the thread twice:

            CREATE TABLE t1 (a INT PRIMARY KEY);
            INSERT INTO t1 VALUES (1);
            INSERT INTO t1 VALUES (2);
            INSERT INTO t1 VALUES (3);
            KILL CONNECTION 4;
            SHOW GLOBAL STATUS LIKE 'Com_insert';
            Variable_name	Value
            Com_insert	6
            SHOW GLOBAL STATUS LIKE 'Com_insert';
            Variable_name	Value
            Com_insert	3
            DROP TABLE t1;
            
            Show
            knielsen Kristian Nielsen added a comment - So this turns out to actually be an old bug. It just happened to trigger more easily after patch for MDEV-8294 . The race is that there is a small window where SHOW GLOBAL STATUS may count twice the values of a thread that is exiting. This happens also for normal threads, not just the slave thread. Here is a test case for latest 10.0, triggers with the following sleep patched in: diff --git a/sql/mysqld.cc b/sql/mysqld.cc index e05c0b6..75af8f7 100644 --- a/sql/mysqld.cc +++ b/sql/mysqld.cc @@ -2759,6 +2759,7 @@ void unlink_thd(THD *thd) thd->add_status_to_global(); +my_sleep(2000000); mysql_mutex_lock(&LOCK_thread_count); thd->unlink(); /* CREATE TABLE t1 (a INT PRIMARY KEY); connect (c1,localhost,root,,); INSERT INTO t1 VALUES (1); INSERT INTO t1 VALUES (2); INSERT INTO t1 VALUES (3); --let $cid= `SELECT CONNECTION_ID()` --connection default eval KILL CONNECTION $cid; --sleep 1 SHOW GLOBAL STATUS LIKE 'Com_insert'; --sleep 2 SHOW GLOBAL STATUS LIKE 'Com_insert'; DROP TABLE t1; The first SHOW GLOBAL STATUS counts the thread twice: CREATE TABLE t1 (a INT PRIMARY KEY); INSERT INTO t1 VALUES (1); INSERT INTO t1 VALUES (2); INSERT INTO t1 VALUES (3); KILL CONNECTION 4; SHOW GLOBAL STATUS LIKE 'Com_insert'; Variable_name Value Com_insert 6 SHOW GLOBAL STATUS LIKE 'Com_insert'; Variable_name Value Com_insert 3 DROP TABLE t1;
            Hide
            knielsen Kristian Nielsen added a comment -

            The race occurs because there is a window of time between a thread does THD::add_status_to_global() and THD::unlink(). During this window, calc_sum_of_all_status() can run and count the contribution of the thread twice.

            Show
            knielsen Kristian Nielsen added a comment - The race occurs because there is a window of time between a thread does THD::add_status_to_global() and THD::unlink(). During this window, calc_sum_of_all_status() can run and count the contribution of the thread twice.
            Hide
            knielsen Kristian Nielsen added a comment -

            The original buildbot test failure can be reproduced easily by injecting the
            following sleep in the code:

            diff --git a/sql/slave.cc b/sql/slave.cc
            index 4129c4c..32e6bd4 100644
            --- a/sql/slave.cc
            +++ b/sql/slave.cc
            @@ -4837,6 +4837,7 @@ log '%s' at position %s, relay log '%s' position: %s%s", RPL_LOG_NAME,
               mysql_cond_broadcast(&rli->stop_cond);
               DBUG_EXECUTE_IF("simulate_slave_delay_at_terminate_bug38694", sleep(5););
               mysql_mutex_unlock(&rli->run_lock);  // tell the world we are done
            +my_sleep(500000);
             
               /*
                 Deactivate the parallel replication thread pool, if there are now no more
            
            Show
            knielsen Kristian Nielsen added a comment - The original buildbot test failure can be reproduced easily by injecting the following sleep in the code: diff --git a/sql/slave.cc b/sql/slave.cc index 4129c4c..32e6bd4 100644 --- a/sql/slave.cc +++ b/sql/slave.cc @@ -4837,6 +4837,7 @@ log '%s' at position %s, relay log '%s' position: %s%s", RPL_LOG_NAME, mysql_cond_broadcast(&rli->stop_cond); DBUG_EXECUTE_IF("simulate_slave_delay_at_terminate_bug38694", sleep(5);); mysql_mutex_unlock(&rli->run_lock); // tell the world we are done +my_sleep(500000); /* Deactivate the parallel replication thread pool, if there are now no more
            Hide
            knielsen Kristian Nielsen added a comment -

            Monty said he would fix this during this week

            Show
            knielsen Kristian Nielsen added a comment - Monty said he would fix this during this week
            Hide
            knielsen Kristian Nielsen added a comment -

            Monty, I've pushed a temporary work-around for the test failure in rpl.rpl_000011:

            http://lists.askmonty.org/pipermail/commits/2015-June/008039.html

            Please remove that work-around from the testcase when you push the fix for this underlying issue.

            Show
            knielsen Kristian Nielsen added a comment - Monty, I've pushed a temporary work-around for the test failure in rpl.rpl_000011: http://lists.askmonty.org/pipermail/commits/2015-June/008039.html Please remove that work-around from the testcase when you push the fix for this underlying issue.
            Hide
            monty Michael Widenius added a comment -

            Fixed by adding a marker if we have added the thread counters to the global counters.
            Fix will be pushed into 5.5 tree

            Show
            monty Michael Widenius added a comment - Fixed by adding a marker if we have added the thread counters to the global counters. Fix will be pushed into 5.5 tree
            Hide
            monty Michael Widenius added a comment -

            Fix pushed into 5.5
            When merged to 10.0, the above temporary fix should be removed and this bug closed.

            Show
            monty Michael Widenius added a comment - Fix pushed into 5.5 When merged to 10.0, the above temporary fix should be removed and this bug closed.

              People

              • Assignee:
                monty Michael Widenius
                Reporter:
                knielsen Kristian Nielsen
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:

                  Agile