Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-6582

DEBUG_SYNC does not reset mysys_var->current_mutex, causes assertion "Trying to unlock mutex that wasn't locked"

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 10.0.13
    • Fix Version/s: 10.0.16
    • Component/s: Tests
    • Labels:
      None

      Description

      https://buildbot.askmonty.org/buildbot/builders/work-amd64-valgrind/builds/5937

      rpl_parallel.test asserts with the following stack trace:

      safe_mutex: Trying to unlock mutex LOCK_wait_commit that wasn't locked at /var/lib/buildbot/maria-slave/work-opensuse-amd64/build/sql/sql_class.cc, line 1817
      Last used at /var/lib/buildbot/maria-slave/work-opensuse-amd64/build/sql/log.cc, line: 6791
      
      mysys/thr_mutex.c:424(safe_mutex_unlock)[0xe1c6f6]
      sql/sql_class.cc:1817(THD::awake(killed_state))[0x5b88af]
      psi/mysql_thread.h:756(inline_mysql_mutex_unlock)[0x5e59e9]
      sql/sql_parse.cc:7453(sql_kill)[0x5e5a52]
      sql/sql_parse.cc:4313(mysql_execute_command(THD*))[0x5e9203]
      sql/sql_parse.cc:6411(mysql_parse(THD*, char*, unsigned int, Parser_state*))[0x5ebf5a]
      sql/sql_parse.cc:1309(dispatch_command(enum_server_command, THD*, char*, unsigned int))[0x5ed96c]
      sql/sql_parse.cc:1005(do_command(THD*))[0x5ef152]
      sql/sql_connect.cc:1379(do_handle_one_connection(THD*))[0x6dd5ff]
      sql/sql_connect.cc:1295(handle_one_connection)[0x6dd725]
      

      I think I have seen similar failures before. I guess it happens because
      mysys_var->current_mutex changes while THD::awake() is trying to kill a
      thread. IIRC, this is related to incorrect use of ENTER_COND and EXIT_COND
      (eg. missing exit or something like that).

      I also think that in THD::awake, we should take a local copy of
      mysys_var->current_mutex, we really don't want to reload the value when we
      unlock the mutex.

      But we could add a debug assertion that the value does not change, to try and
      catch more of these problems (incorrect usage of ENTER_COND/EXIT_COND).

        Gliffy Diagrams

          Attachments

            Issue Links

              Activity

              Show
              knielsen Kristian Nielsen added a comment - I found another instance of this: http://buildbot.askmonty.org/buildbot/builders/kvm-dgcov-jaunty-i386/builds/4408
              Hide
              knielsen Kristian Nielsen added a comment -

              Another way this bug can occur is if DEBUG_SYNC is used inside ENTER_COND / EXIT_COND. Because DEBUG_SYNC changes current_mutex, which can race to give the problem seen here.

              Show
              knielsen Kristian Nielsen added a comment - Another way this bug can occur is if DEBUG_SYNC is used inside ENTER_COND / EXIT_COND. Because DEBUG_SYNC changes current_mutex, which can race to give the problem seen here.
              Hide
              knielsen Kristian Nielsen added a comment -

              I think I found it.

              I managed to reproduce the bug reliably by inserting these sleeps in the code:

              === modified file 'sql/log.cc'
              --- sql/log.cc	2014-11-17 07:53:42 +0000
              +++ sql/log.cc	2014-11-26 08:59:16 +0000
              @@ -6833,6 +6833,7 @@ MYSQL_BIN_LOG::queue_for_group_commit(gr
                     */
                     wfc->opaque_pointer= orig_entry;
                     DEBUG_SYNC(orig_entry->thd, "group_commit_waiting_for_prior");
              +my_sleep(100000);
                     orig_entry->thd->ENTER_COND(&wfc->COND_wait_commit,
                                                 &wfc->LOCK_wait_commit,
                                                 &stage_waiting_for_prior_transaction_to_commit,
              
              === modified file 'sql/sql_class.cc'
              --- sql/sql_class.cc	2014-11-13 10:01:31 +0000
              +++ sql/sql_class.cc	2014-11-26 08:59:05 +0000
              @@ -1810,6 +1810,7 @@ void THD::awake(killed_state state_to_se
                     for (i= 0; i < WAIT_FOR_KILL_TRY_TIMES * SECONDS_TO_WAIT_FOR_KILL; i++)
                     {
                       int ret= mysql_mutex_trylock(mysys_var->current_mutex);
              +my_sleep(200000);
                       mysql_cond_broadcast(mysys_var->current_cond);
                       if (!ret)
                       {
              
              

              I think the problem is in debug_sync_execute():

                    if (thd->mysys_var)
                    {
                      old_mutex= thd->mysys_var->current_mutex;
                      thd->mysys_var->current_mutex= &debug_sync_global.ds_mutex;
              ...
                    if (old_mutex)
                      thd->mysys_var->current_mutex= old_mutex;
              

              Note, that if old_mutex is NULL, then debug_sync_execute() will leave a
              dangling non-NULL value in thd->mysys_var->current_mutex. This allows a race,
              where first THD::awake() fetches and locks the dangling mutex, then ENTER_COND
              changes current_mutex to a new value, then THD::awake() reloads the new value
              and unlocks a different mutex from the one it locked (the dangling one).

              Seems clear enough now.

              Show
              knielsen Kristian Nielsen added a comment - I think I found it. I managed to reproduce the bug reliably by inserting these sleeps in the code: === modified file 'sql/log.cc' --- sql/log.cc 2014-11-17 07:53:42 +0000 +++ sql/log.cc 2014-11-26 08:59:16 +0000 @@ -6833,6 +6833,7 @@ MYSQL_BIN_LOG::queue_for_group_commit(gr */ wfc->opaque_pointer= orig_entry; DEBUG_SYNC(orig_entry->thd, "group_commit_waiting_for_prior"); +my_sleep(100000); orig_entry->thd->ENTER_COND(&wfc->COND_wait_commit, &wfc->LOCK_wait_commit, &stage_waiting_for_prior_transaction_to_commit, === modified file 'sql/sql_class.cc' --- sql/sql_class.cc 2014-11-13 10:01:31 +0000 +++ sql/sql_class.cc 2014-11-26 08:59:05 +0000 @@ -1810,6 +1810,7 @@ void THD::awake(killed_state state_to_se for (i= 0; i < WAIT_FOR_KILL_TRY_TIMES * SECONDS_TO_WAIT_FOR_KILL; i++) { int ret= mysql_mutex_trylock(mysys_var->current_mutex); +my_sleep(200000); mysql_cond_broadcast(mysys_var->current_cond); if (!ret) { I think the problem is in debug_sync_execute(): if (thd->mysys_var) { old_mutex= thd->mysys_var->current_mutex; thd->mysys_var->current_mutex= &debug_sync_global.ds_mutex; ... if (old_mutex) thd->mysys_var->current_mutex= old_mutex; Note, that if old_mutex is NULL, then debug_sync_execute() will leave a dangling non-NULL value in thd->mysys_var->current_mutex. This allows a race, where first THD::awake() fetches and locks the dangling mutex, then ENTER_COND changes current_mutex to a new value, then THD::awake() reloads the new value and unlocks a different mutex from the one it locked (the dangling one). Seems clear enough now.
              Show
              knielsen Kristian Nielsen added a comment - http://lists.askmonty.org/pipermail/commits/2014-November/007071.html

                People

                • Assignee:
                  knielsen Kristian Nielsen
                  Reporter:
                  knielsen Kristian Nielsen
                • Votes:
                  0 Vote for this issue
                  Watchers:
                  1 Start watching this issue

                  Dates

                  • Created:
                    Updated:
                    Resolved: