Uploaded image for project: 'MariaDB Server'
  1. MariaDB Server
  2. MDEV-6751

possible deadlock with ALTER + threadpool + events

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 10.0.13
    • Fix Version/s: 10.0.14
    • Labels:
      None
    • Environment:
      Ubuntu 12.04 LTS Linux db1062 3.2.0-60-generic #91-Ubuntu SMP
      Server version: 10.0.13-MariaDB-log Source distribution

      Description

      Am running the following on a slave:

      • Largish (24h, 600M rows, 200G) ALTER TABLE
      • Events with INFORMATION_SCHEMA queries
      • Threadpool pool-of-threads active
      • Replication active
      • No other significant traffic

      After several hours, MariaDB locks up with 0% CPU and disk activity, and no response on existing or new connections on port, extra_port, or socket.

      Attached are gdb backtraces for two occurrences, examples of the ALTER and the INFORMATION_SCHEMA activity, and other info. Would appreciate any insight from devs to identify the deadlock, and to narrow down the variables for a test case that isn't 200G.

      Am presently trialing the ALTER outside the threadpool using the extra_port, with all other settings unchanged.

      Other notes:

      • It doesn't seem to be a thread pool overload, as there aren't enough threads in the backtrace.
      • The INFORMATION_SCHEMA event traffic uses GET_LOCK to serialize some activity and prevent pile-up.

        Gliffy Diagrams

          Attachments

          1. db1062_lockup_gdb.2.log
            172 kB
          2. db1062_lockup_gdb.3.log
            183 kB
          3. db1062_lockup_gdb.log
            162 kB
          4. db1062_lockup.ALTER.txt
            4 kB
          5. db1062_lockup.cmake.txt
            0.6 kB
          6. db1062_lockup.global_vars.txt
            325 kB
          7. db1062_lockup.i_s_query.txt
            0.4 kB
          8. db1062_lockup.processlist.txt
            13 kB

            Activity

            Hide
            sean Sean Pringle added a comment -

            Using extra_port for the ALTER still locked up, but managed to catch it in action. See attached processlist and gdb trace #3. Similar to https://mariadb.atlassian.net/browse/MDEV-5551 ?

            The client connections in Command="Killed" and State="Cleaning up" successfully connected according to client logs, and apparently got past authentication, but the first query always failed with "Got timeout reading communication packets".

            Subsequently the client did not show open TCP connections hanging around in lsof -i tcp output. However the server did continue to show open TCP connection in lsof, and did not report anything in err log with log_warnings=2.

            Notice the REPLACE INTO `heartbeat` query in state Update for many seconds; that is percona toolkit pt-heartbeat writing to an InnoDB table which is usually instant. Attempting SHOW ENGINE INNODB STATUS to investigate simply hung indefinitely.

            Show
            sean Sean Pringle added a comment - Using extra_port for the ALTER still locked up, but managed to catch it in action. See attached processlist and gdb trace #3. Similar to https://mariadb.atlassian.net/browse/MDEV-5551 ? The client connections in Command="Killed" and State="Cleaning up" successfully connected according to client logs, and apparently got past authentication , but the first query always failed with "Got timeout reading communication packets". Subsequently the client did not show open TCP connections hanging around in lsof -i tcp output. However the server did continue to show open TCP connection in lsof, and did not report anything in err log with log_warnings=2. Notice the REPLACE INTO `heartbeat` query in state Update for many seconds; that is percona toolkit pt-heartbeat writing to an InnoDB table which is usually instant. Attempting SHOW ENGINE INNODB STATUS to investigate simply hung indefinitely.
            Hide
            elenst Elena Stepanova added a comment -

            Hi,

            What happened to the ALTER query? Who or what killed it, and was it before or after you identified the situation as a deadlock?

            Thanks.

            Show
            elenst Elena Stepanova added a comment - Hi, What happened to the ALTER query? Who or what killed it, and was it before or after you identified the situation as a deadlock? Thanks.
            Hide
            sean Sean Pringle added a comment -

            In the example processlist, I killed the ALTER after everything else had already become Command=Killed without known intervention by human or tool. It didn't help.

            Show
            sean Sean Pringle added a comment - In the example processlist, I killed the ALTER after everything else had already become Command=Killed without known intervention by human or tool. It didn't help.
            Hide
            elenst Elena Stepanova added a comment -

            Jan Lindström,

            Could you please check if it has something in common with other problems reported or observed, or it's a new one?

            Show
            elenst Elena Stepanova added a comment - Jan Lindström , Could you please check if it has something in common with other problems reported or observed, or it's a new one?
            Hide
            sean Sean Pringle added a comment - - edited

            Tried disabling thread pool, replication, other traffic, all to no effect. The problem recurred. Each time after restart I waited until transactions had completed rollback.

            Observing the threads stuck in buf_, mtr_, and log_ calls I tried reverting to a single buffer pool instance, innodb_buffer_pool_instances=1 (which matches our 5.5 config). That allowed the ALTER to complete normally without a hiccup.

            A flushing lock-cycle of some sort?

            Show
            sean Sean Pringle added a comment - - edited Tried disabling thread pool, replication, other traffic, all to no effect. The problem recurred. Each time after restart I waited until transactions had completed rollback. Observing the threads stuck in buf_, mtr_, and log_ calls I tried reverting to a single buffer pool instance, innodb_buffer_pool_instances=1 (which matches our 5.5 config). That allowed the ALTER to complete normally without a hiccup. A flushing lock-cycle of some sort?
            Hide
            elenst Elena Stepanova added a comment -

            If you can experiment with your instance, could you maybe try setting innodb_buffer_pool_instances to 1?
            We are currently investigating a problem which looks related to multiple buffer pool instances. I am not sure at all that your deadlock is the same issue, but it won't hurt to try.

            Show
            elenst Elena Stepanova added a comment - If you can experiment with your instance, could you maybe try setting innodb_buffer_pool_instances to 1? We are currently investigating a problem which looks related to multiple buffer pool instances. I am not sure at all that your deadlock is the same issue, but it won't hurt to try.
            Hide
            jplindst Jan Lindström added a comment -

            Do you see long semaphore wait warnings/errors on error log ?

            Show
            jplindst Jan Lindström added a comment - Do you see long semaphore wait warnings/errors on error log ?
            Hide
            sean Sean Pringle added a comment -

            @Elena, yes innodb_buffer_pool_instances=1 helped. See comment above.

            @Jan, no long semaphore warnings, and no 600 second abort.

            Show
            sean Sean Pringle added a comment - @Elena, yes innodb_buffer_pool_instances=1 helped. See comment above. @Jan, no long semaphore warnings, and no 600 second abort.
            Hide
            elenst Elena Stepanova added a comment -

            Elena, yes innodb_buffer_pool_instances=1 helped. See comment above.

            Sorry, I somehow missed that part of the comment. Thanks.

            Show
            elenst Elena Stepanova added a comment - Elena, yes innodb_buffer_pool_instances=1 helped. See comment above. Sorry, I somehow missed that part of the comment. Thanks.

              People

              • Assignee:
                jplindst Jan Lindström
                Reporter:
                sean Sean Pringle
              • Votes:
                0 Vote for this issue
                Watchers:
                5 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: